date:20181015

Re: [PATCH net-next 0/7] tcp: second round for EDT conversion

2018-10-15 Thread David Miller

From: Eric Dumazet 
Date: Mon, 15 Oct 2018 09:37:51 -0700

> First round of EDT patches left TCP stack in a non optimal state.
> 
> - High speed flows suffered from loss of performance, addressed
>   by the first patch of this series.
> 
> - Second patch brings pacing to the current state of networking,
>   since we now reach ~100 Gbit on a single TCP flow.
> 
> - Third patch implements a mitigation for scheduling delays,
>   like the one we did in sch_fq in the past.
> 
> - Fourth patch removes one special case in sch_fq for ACK packets.
> 
> - Fifth patch removes a serious perfomance cost for TCP internal
>   pacing. We should setup the high resolution timer only if
>   really needed.
> 
> - Sixth patch fixes a typo in BBR.
> 
> - Last patch is one minor change in cdg congestion control.
> 
> Neal Cardwell also has a patch series fixing BBR after
> EDT adoption.

Series applied, thanks Eric.

Re: [PATCH net] sctp: use the pmtu from the icmp packet to update transport pathmtu

2018-10-15 Thread David Miller

From: Xin Long 
Date: Mon, 15 Oct 2018 19:58:29 +0800

> Other than asoc pmtu sync from all transports, sctp_assoc_sync_pmtu
> is also processing transport pmtu_pending by icmp packets. But it's
> meaningless to use sctp_dst_mtu(t->dst) as new pmtu for a transport.
> 
> The right pmtu value should come from the icmp packet, and it would
> be saved into transport->mtu_info in this patch and used later when
> the pmtu sync happens in sctp_sendmsg_to_asoc or sctp_packet_config.
> 
> Besides, without this patch, as pmtu can only be updated correctly
> when receiving a icmp packet and no place is holding sock lock, it
> will take long time if the sock is busy with sending packets.
> 
> Note that it doesn't process transport->mtu_info in .release_cb(),
> as there is no enough information for pmtu update, like for which
> asoc or transport. It is not worth traversing all asocs to check
> pmtu_pending. So unlike tcp, sctp does this in tx path, for which
> mtu_info needs to be atomic_t.
> 
> Signed-off-by: Xin Long 

Applied.

Re: [PATCH net,stable 1/1] net: fec: don't dump RX FIFO register when not available

2018-10-15 Thread David Miller

From: Andy Duan 
Date: Mon, 15 Oct 2018 05:19:00 +

> From: Fugang Duan 
> 
> Commit db65f35f50e0 ("net: fec: add support of ethtool get_regs") introduce
> ethool "--register-dump" interface to dump all FEC registers.
> 
> But not all silicon implementations of the Freescale FEC hardware module
> have the FRBR (FIFO Receive Bound Register) and FRSR (FIFO Receive Start
> Register) register, so we should not be trying to dump them on those that
> don't.
> 
> To fix it we create a quirk flag, FEC_QUIRK_HAS_RFREG, and check it before
> dump those RX FIFO registers.
> 
> Signed-off-by: Fugang Duan 

Applied and queued up for -stable.

[PATCH bpf-next] libbpf: Per-symbol visibility for DSO

2018-10-15 Thread Andrey Ignatov

Make global symbols in libbpf DSO hidden by default with
-fvisibility=hidden and export symbols that are part of ABI explicitly
with __attribute__((visibility("default"))).

This is common practice that should prevent from accidentally exporting
a symbol, that is not supposed to be a part of ABI what, in turn,
improves both libbpf developer- and user-experiences. See [1] for more
details.

Export control becomes more important since more and more projects use
libbpf.

The patch doesn't export a bunch of netlink related functions since as
agreed in [2] they'll be reworked. That doesn't break bpftool since
bpftool links libbpf statically.

[1] https://www.akkadia.org/drepper/dsohowto.pdf (2.2 Export Control)
[2] https://www.mail-archive.com/netdev@vger.kernel.org/msg251434.html

Signed-off-by: Andrey Ignatov 
---
 tools/lib/bpf/Makefile |   1 +
 tools/lib/bpf/bpf.h| 118 ++
 tools/lib/bpf/btf.h|  22 +++--
 tools/lib/bpf/libbpf.h | 186 ++---
 4 files changed, 179 insertions(+), 148 deletions(-)

diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile
index 79d84413ddf2..425b480bda75 100644
--- a/tools/lib/bpf/Makefile
+++ b/tools/lib/bpf/Makefile
@@ -125,6 +125,7 @@ override CFLAGS += $(EXTRA_WARNINGS)
 override CFLAGS += -Werror -Wall
 override CFLAGS += -fPIC
 override CFLAGS += $(INCLUDES)
+override CFLAGS += -fvisibility=hidden
 
 ifeq ($(VERBOSE),1)
   Q =
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 69a4d40c4227..258c3c178333 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -27,6 +27,10 @@
 #include 
 #include 
 
+#ifndef LIBBPF_API
+#define LIBBPF_API __attribute__((visibility("default")))
+#endif
+
 struct bpf_create_map_attr {
const char *name;
enum bpf_map_type map_type;
@@ -42,21 +46,24 @@ struct bpf_create_map_attr {
__u32 inner_map_fd;
 };
 
-int bpf_create_map_xattr(const struct bpf_create_map_attr *create_attr);
-int bpf_create_map_node(enum bpf_map_type map_type, const char *name,
-   int key_size, int value_size, int max_entries,
-   __u32 map_flags, int node);
-int bpf_create_map_name(enum bpf_map_type map_type, const char *name,
-   int key_size, int value_size, int max_entries,
-   __u32 map_flags);
-int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size,
-  int max_entries, __u32 map_flags);
-int bpf_create_map_in_map_node(enum bpf_map_type map_type, const char *name,
-  int key_size, int inner_map_fd, int max_entries,
-  __u32 map_flags, int node);
-int bpf_create_map_in_map(enum bpf_map_type map_type, const char *name,
- int key_size, int inner_map_fd, int max_entries,
- __u32 map_flags);
+LIBBPF_API int
+bpf_create_map_xattr(const struct bpf_create_map_attr *create_attr);
+LIBBPF_API int bpf_create_map_node(enum bpf_map_type map_type, const char 
*name,
+  int key_size, int value_size,
+  int max_entries, __u32 map_flags, int node);
+LIBBPF_API int bpf_create_map_name(enum bpf_map_type map_type, const char 
*name,
+  int key_size, int value_size,
+  int max_entries, __u32 map_flags);
+LIBBPF_API int bpf_create_map(enum bpf_map_type map_type, int key_size,
+ int value_size, int max_entries, __u32 map_flags);
+LIBBPF_API int bpf_create_map_in_map_node(enum bpf_map_type map_type,
+ const char *name, int key_size,
+ int inner_map_fd, int max_entries,
+ __u32 map_flags, int node);
+LIBBPF_API int bpf_create_map_in_map(enum bpf_map_type map_type,
+const char *name, int key_size,
+int inner_map_fd, int max_entries,
+__u32 map_flags);
 
 struct bpf_load_program_attr {
enum bpf_prog_type prog_type;
@@ -74,44 +81,49 @@ struct bpf_load_program_attr {
 
 /* Recommend log buffer size */
 #define BPF_LOG_BUF_SIZE (256 * 1024)
-int bpf_load_program_xattr(const struct bpf_load_program_attr *load_attr,
-  char *log_buf, size_t log_buf_sz);
-int bpf_load_program(enum bpf_prog_type type, const struct bpf_insn *insns,
-size_t insns_cnt, const char *license,
-__u32 kern_version, char *log_buf,
-size_t log_buf_sz);
-int bpf_verify_program(enum bpf_prog_type type, const struct bpf_insn *insns,
-  size_t insns_cnt, int strict_alignment,
-  const char *license, __u32 kern_version,
-  char *log_buf, size_t log_buf_sz, int log_level);
+LIBBPF_API

Re: [PATCH -next] fore200e: fix missing unlock on error in bsq_audit()

2018-10-15 Thread David Miller

From: Wei Yongjun 
Date: Mon, 15 Oct 2018 03:07:16 +

> Add the missing unlock before return from function bsq_audit()
> in the error handling case.
> 
> Fixes: 1d9d8be91788 ("fore200e: check for dma mapping failures")
> Signed-off-by: Wei Yongjun 

Applied.

Re: [PATCH net-next 00/23] bnxt_en: Add support for new 57500 chips.

2018-10-15 Thread David Miller

From: Michael Chan 
Date: Sun, 14 Oct 2018 07:02:36 -0400

> This patch-set is larger than normal because I wanted a complete series
> to add basic support for the new 57500 chips.  The new chips have the
> following main differences compared to legacy chips:
> 
> 1. Requires the PF driver to allocate DMA context memory as a backing
> store.
> 2. New NQ (notification queue) for interrupt events.
> 3. One or more CP rings can be associated with an NQ.
> 4. 64-bit doorbells.
> 
> Most other structures and firmware APIs are compatible with legacy
> devices with some exceptions.  For example, ring groups are no longer
> used and RSS table format has changed.
> 
> The patch-set includes the usual firmware spec. update, some refactoring
> and restructuring, and adding the new code to add basic support for the
> new class of devices.

Looks good, series applied, thanks Michael.

Re: [PATCH net] ipv6: mcast: fix a use-after-free in inet6_mc_check

2018-10-15 Thread David Miller

From: Eric Dumazet 
Date: Fri, 12 Oct 2018 18:58:53 -0700

> syzbot found a use-after-free in inet6_mc_check [1]
> 
> The problem here is that inet6_mc_check() uses rcu
> and read_lock(>sflock)
> 
> So the fact that ip6_mc_leave_src() is called under RTNL
> and the socket lock does not help us, we need to acquire
> iml->sflock in write mode.
> 
> In the future, we should convert all this stuff to RCU.
> 
> [1]
> BUG: KASAN: use-after-free in ipv6_addr_equal include/net/ipv6.h:521 [inline]
> BUG: KASAN: use-after-free in inet6_mc_check+0xae7/0xb40 net/ipv6/mcast.c:649
> Read of size 8 at addr 8801ce7f2510 by task syz-executor0/22432
 ...
> Signed-off-by: Eric Dumazet 
> Reported-by: syzbot 

Applied and queued up for -stable, thanks.

Re: [PATCH net-next 0/2] selftests: pmtu: Add test choice and captures

2018-10-15 Thread David Miller

From: Stefano Brivio 
Date: Fri, 12 Oct 2018 23:54:12 +0200

> This series adds a couple of features useful for debugging: 1/2
> allows selecting single tests and 2/2 adds optional traffic
> captures.
> 
> Semantics for current invocation of test script are preserved.

M0AR SELF TESTS!

I love it.

Keep them coming.

Series applied, thanks.

Re: [PATCH net-next] r8169: simplify rtl8169_set_magic_reg

2018-10-15 Thread David Miller

From: Heiner Kallweit 
Date: Fri, 12 Oct 2018 23:23:57 +0200

> Simplify this function, no functional change intended.
> 
> Signed-off-by: Heiner Kallweit 

Applied.

Re: [PATCH net-next] r8169: remove unneeded call to netif_stop_queue in rtl8169_net_suspend

2018-10-15 Thread David Miller

From: Heiner Kallweit 
Date: Fri, 12 Oct 2018 23:30:52 +0200

> netif_device_detach() stops all tx queues already, so we don't need
> this call.
> 
> Signed-off-by: Heiner Kallweit 

Applied.

Re: [PATCH net-next] nfp: devlink port split support for 1x100G CXP NIC

2018-10-15 Thread David Miller

From: Jakub Kicinski 
Date: Fri, 12 Oct 2018 11:09:01 -0700

> From: Ryan C Goodfellow 
> 
> This commit makes it possible to use devlink to split the 100G CXP
> Netronome into two 40G interfaces. Currently when you ask for 2
> interfaces, the math in src/nfp_devlink.c:nfp_devlink_port_split
> calculates that you want 5 lanes per port because for some reason
> eth_port.port_lanes=10 (shouldn't this be 12 for CXP?). What we really
> want when asking for 2 breakout interfaces is 4 lanes per port. This
> commit makes that happen by calculating based on 8 lanes if 10 are
> present.
> 
> Signed-off-by: Ryan C Goodfellow 
> Reviewed-by: Jakub Kicinski 
> Reviewed-by: Greg Weeks 

Applied.

Re: [PATCH net-next 0/6] dpaa2-eth: code cleanup

2018-10-15 Thread David Miller

From: Ioana Ciornei 
Date: Fri, 12 Oct 2018 16:27:16 +

> There are no functional changes in this patch set, only some cleanup
> changes such as: unused parameters, uninitialized variables and
> unnecessary Kconfig dependencies.

Series applied.

Re: [PATCH net] ipv6: rate-limit probes for neighbourless routes

2018-10-15 Thread David Miller

From: Sabrina Dubroca 
Date: Fri, 12 Oct 2018 16:22:47 +0200

> When commit 270972554c91 ("[IPV6]: ROUTE: Add Router Reachability
> Probing (RFC4191).") introduced router probing, the rt6_probe() function
> required that a neighbour entry existed. This neighbour entry is used to
> record the timestamp of the last probe via the ->updated field.
> 
> Later, commit 2152caea7196 ("ipv6: Do not depend on rt->n in rt6_probe().")
> removed the requirement for a neighbour entry. Neighbourless routes skip
> the interval check and are not rate-limited.
> 
> This patch adds rate-limiting for neighbourless routes, by recording the
> timestamp of the last probe in the fib6_info itself.
> 
> Fixes: 2152caea7196 ("ipv6: Do not depend on rt->n in rt6_probe().")
> Signed-off-by: Sabrina Dubroca 
> Reviewed-by: Stefano Brivio 

Applied and queued up for -stable.

Re: [PATCH net-next 0/2] net: phy: improve and simplify state machine

2018-10-15 Thread David Miller

From: Heiner Kallweit 
Date: Thu, 11 Oct 2018 22:35:35 +0200

> Improve / simplify handling of states PHY_RUNNING and PHY_RESUMING in
> phylib state machine.

Series applied.

Re: [PATCH net-next v2] vxlan: support NTF_USE refresh of fdb entries

2018-10-15 Thread David Miller

From: Roopa Prabhu 
Date: Thu, 11 Oct 2018 12:35:13 -0700

> From: Roopa Prabhu 
> 
> This makes use of NTF_USE in vxlan driver consistent
> with bridge driver.
> 
> Signed-off-by: Roopa Prabhu 

Applied.

Re: [Patch net] llc: set SOCK_RCU_FREE in llc_sap_add_socket()

2018-10-15 Thread David Miller

From: Cong Wang 
Date: Thu, 11 Oct 2018 11:15:13 -0700

> WHen an llc sock is added into the sk_laddr_hash of an llc_sap,
> it is not marked with SOCK_RCU_FREE.
> 
> This causes that the sock could be freed while it is still being
> read by __llc_lookup_established() with RCU read lock. sock is
> refcounted, but with RCU read lock, nothing prevents the readers
> getting a zero refcnt.
> 
> Fix it by setting SOCK_RCU_FREE in llc_sap_add_socket().
> 
> Reported-by: syzbot+11e05f04c15e03be5...@syzkaller.appspotmail.com
> Signed-off-by: Cong Wang 

Applied and queued up for -stable.

Re: [PATCH net-next v7] net/ncsi: Extend NC-SI Netlink interface to allow user space to send NC-SI command

2018-10-15 Thread David Miller

From: 
Date: Thu, 11 Oct 2018 18:07:37 +

> The new command (NCSI_CMD_SEND_CMD) is added to allow user space application
> to send NC-SI command to the network card.
> Also, add a new attribute (NCSI_ATTR_DATA) for transferring request and 
> response.
> 
> The work flow is as below. 
> 
> Request:
> User space application
>   -> Netlink interface (msg)
>   -> new Netlink handler - ncsi_send_cmd_nl()
>   -> ncsi_xmit_cmd()
> 
> Response:
> Response received - ncsi_rcv_rsp()
>   -> internal response handler - ncsi_rsp_handler_xxx()
>   -> ncsi_rsp_handler_netlink()
>   -> ncsi_send_netlink_rsp ()
>   -> Netlink interface (msg)
>   -> user space application
> 
> Command timeout - ncsi_request_timeout()
>   -> ncsi_send_netlink_timeout ()
>   -> Netlink interface (msg with zero data length)
>   -> user space application
> 
> Error:
> Error detected
>   -> ncsi_send_netlink_err ()
>   -> Netlink interface (err msg)
>   -> user space application
> 
> 
> Signed-off-by: Justin Lee  

Applied.

Re: [PATCH net-next] net: phy: trigger state machine immediately in phy_start_machine

2018-10-15 Thread David Miller

From: Heiner Kallweit 
Date: Thu, 11 Oct 2018 19:31:47 +0200

> When starting the state machine there may be work to be done
> immediately, e.g. if the initial state is PHY_UP then the state
> machine may trigger an autonegotiation. Having said that I see no need
> to wait a second until the state machine is run first time.
> 
> Signed-off-by: Heiner Kallweit 

Applied.

Re: [PATCH net-next 0/3] veth: XDP stats improvement

2018-10-15 Thread David Miller

From: Toshiaki Makita 
Date: Thu, 11 Oct 2018 18:36:47 +0900

> ndo_xdp_xmit in veth did not update packet counters as described in [1].
> Also, current implementation only updates counters on tx side so rx side
> events like XDP_DROP were not collected.
> This series implements the missing accounting as well as support for
> ethtool per-queue stats in veth.
> 
> Patch 1: Update drop counter in ndo_xdp_xmit.
> Patch 2: Update packet and byte counters for all XDP path, and drop
>  counter on XDP_DROP.
> Patch 3: Support per-queue ethtool stats for XDP counters.
> 
> Note that counters are maintained on per-queue basis for XDP but not
> otherwise (per-cpu and atomic as before). This is because 1) tx path in
> veth is essentially lockless so we cannot update per-queue stats on tx,
> and 2) rx path is net core routine (process_backlog) which cannot update
> per-queue based stats when XDP is disabled. On the other hand there are
> real rxqs and napi handlers for veth XDP, so update per-queue stats on
> rx for XDP packets, and use them to calculate tx counters as well,
> contrary to the existing non-XDP counters.
> 
> [1] https://patchwork.ozlabs.org/cover/953071/#1967449
> 
> Signed-off-by: Toshiaki Makita 

Series applied.

Re: [pull request][net 0/3] Mellanox, mlx5 fixes 2018-10-10

2018-10-15 Thread David Miller

From: Saeed Mahameed 
Date: Wed, 10 Oct 2018 18:32:41 -0700

> This pull request includes some fixes to mlx5 driver,
> Please pull and let me know if there's any problem.

Pulled.

> For -stable v4.11:
> ('net/mlx5: Take only bit 24-26 of wqe.pftype_wq for page fault type')
> For -stable v4.17:
> ('net/mlx5: Fix memory leak when setting fpga ipsec caps')
> For -stable v4.18:
> ('net/mlx5: WQ, fixes for fragmented WQ buffers API')

Queued up.

Re: [pull request][net-next 0/7] Mellanox, mlx5e and IPoIB netlink support fixes

2018-10-15 Thread David Miller

From: Saeed Mahameed 
Date: Wed, 10 Oct 2018 18:24:37 -0700

> This series was meant to go to -rc but due to this late submission and the
> size/complexity of this patchset, I am submitting to net-next.
> 
> This series came to fix a very serious regression in RDMA
> IPoIB netlink child creation API, the patcheset contains fixes to two
> components and they must come together:
> 1) IPoIB netllink implementation to allow allocation of the netdev to be done 
> by
> the rtnl netdev code
> 2) mlx5e refactoring and changes to correctly initialize netdevices
> created by the rdma stack.
> 
> For more details please see tag log below.
> 
> Please pull and let me know if there's any problem.

Pulled, thanks.

Re: [PATCH net v3] net/sched: cls_api: add missing validation of netlink attributes

2018-10-15 Thread David Miller

From: Davide Caratti 
Date: Wed, 10 Oct 2018 22:00:58 +0200

> Similarly to what has been done in 8b4c3cdd9dd8 ("net: sched: Add policy
> validation for tc attributes"), fix classifier code to add validation of
> TCA_CHAIN and TCA_KIND netlink attributes.
> 
> tested with:
>  # ./tdc.py -c filter
> 
> v2: Let sch_api and cls_api share nla_policy they have in common, thanks
> to David Ahern.
> v3: Avoid EXPORT_SYMBOL(), as validation of those attributes is not done
> by TC modules, thanks to Cong Wang.
> While at it, restore the 'Delete / get qdisc' comment to its orginal
> position, just above tc_get_qdisc() function prototype.
> 
> Fixes: 5bc1701881e39 ("net: sched: introduce multichain support for filters")
> Signed-off-by: Davide Caratti 

Applied and queued up for -stable.

Re: [PATCH net-next v2 0/2] FDDI: DEC FDDIcontroller 700 TURBOchannel adapter support

2018-10-15 Thread David Miller

From: "Maciej W. Rozycki" 
Date: Tue, 9 Oct 2018 23:57:36 +0100 (BST)

>  Questions, comments?  Otherwise, please apply.

Series applied, thank you.

Re: [PATCH net-next] tun: Consistently configure generic netdev params via rtnetlink

2018-10-15 Thread David Miller

From: Serhey Popovych 
Date: Tue,  9 Oct 2018 21:21:01 +0300

> Configuring generic network device parameters on tun will fail in
> presence of IFLA_INFO_KIND attribute in IFLA_LINKINFO nested attribute
> since tun_validate() always return failure.
> 
> This can be visualized with following ip-link(8) command sequences:
> 
>   # ip link set dev tun0 group 100
>   # ip link set dev tun0 group 100 type tun
>   RTNETLINK answers: Invalid argument
> 
> with contrast to dummy and veth drivers:
> 
>   # ip link set dev dummy0 group 100
>   # ip link set dev dummy0 type dummy
> 
>   # ip link set dev veth0 group 100
>   # ip link set dev veth0 group 100 type veth
> 
> Fix by returning zero in tun_validate() when @data is NULL that is
> always in case since rtnl_link_ops->maxtype is zero in tun driver.
> 
> Fixes: f019a7a594d9 ("tun: Implement ip link del tunXXX")
> Signed-off-by: Serhey Popovych 

Applied, thank you.

crash in xt_policy due to skb_dst_drop() in nf_ct_frag6_gather()

2018-10-15 Thread Maciej Żenczykowski

I believe that:

commit ad8b1ffc3efae2f65080bdb11145c87d299b8f9a
Author: Florian Westphal 
netfilter: ipv6: nf_defrag: drop skb dst before queueing

+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -618,6 +618,8 @@ int nf_ct_frag6_gather(struct net *net, struct
sk_buff *skb, u32 user)
fq->q.meat == fq->q.len &&
nf_ct_frag6_reasm(fq, skb, dev))
ret = 0;
+   else
+   skb_dst_drop(skb);

 out_unlock:
spin_unlock_bh(>q.lock);

Is causing a crash on android after upgrading from 4.9.96 to 4.9.119

This is because clatd ipv4 to ipv6 translation user space daemon is
functionally equivalent to the syzkaller reproducer.
It will convert ipv4 frags it receives via tap into ipv6 frags which
it will write out via rawv6 sendmsg.

However we are also using xt_policy, after stripping cruft this is basically:

ip6tables -A OUTPUT -m policy --dir out --pol ipsec

Crash is:

match_policy_out()
const struct dst_entry *dst = skb_dst(skb); // returns NULL
if (dst->xfrm == NULL) <-- dst == NULL -> panic
[ 1136.606948] c1 2675   [] policy_mt+0x34/0x18c
[ 1136.606954] c1 2675   [] ip6t_do_table+0x280/0x684
[ 1136.606961] c1 2675   [] ip6table_filter_hook+0x20/0x28
[ 1136.606969] c1 2675   [] nf_hook_slow+0x98/0x154
[ 1136.606977] c1 2675   [] rawv6_sendmsg+0xd14/0x1520
[ 1136.606985] c1 2675   [] inet_sendmsg+0x100/0x1b0
[ 1136.606993] c1 2675   [] ___sys_sendmsg+0x2a0/0x414
[ 1136.606999] c1 2675   [] SyS_sendmsg+0x94/0xe4

Just checking for NULL in xt_policy.c:match_policy_out() and returning
0 or 1 unconditionally seems to be the wrong thing to do,
since after all prior to skb_dst_drop() the skb->dst->xfrm might not
have been NULL.

Maciej Żenczykowski, Kernel Networking Developer @ Google

[PATCH v2 net-next 06/11] ipmr: Refactor mr_rtm_dumproute

2018-10-15 Thread David Ahern

From: David Ahern 

Move per-table loops from mr_rtm_dumproute to mr_table_dump and export
mr_table_dump for dumps by specific table id.

Signed-off-by: David Ahern 
---
 include/linux/mroute_base.h |  6 
 net/ipv4/ipmr_base.c| 88 -
 2 files changed, 61 insertions(+), 33 deletions(-)

diff --git a/include/linux/mroute_base.h b/include/linux/mroute_base.h
index 6675b9f81979..db85373c8d15 100644
--- a/include/linux/mroute_base.h
+++ b/include/linux/mroute_base.h
@@ -283,6 +283,12 @@ void *mr_mfc_find_any(struct mr_table *mrt, int vifi, void 
*hasharg);
 
 int mr_fill_mroute(struct mr_table *mrt, struct sk_buff *skb,
   struct mr_mfc *c, struct rtmsg *rtm);
+int mr_table_dump(struct mr_table *mrt, struct sk_buff *skb,
+ struct netlink_callback *cb,
+ int (*fill)(struct mr_table *mrt, struct sk_buff *skb,
+ u32 portid, u32 seq, struct mr_mfc *c,
+ int cmd, int flags),
+ spinlock_t *lock);
 int mr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb,
 struct mr_table *(*iter)(struct net *net,
  struct mr_table *mrt),
diff --git a/net/ipv4/ipmr_base.c b/net/ipv4/ipmr_base.c
index 1ad9aa62a97b..132dd2613ca5 100644
--- a/net/ipv4/ipmr_base.c
+++ b/net/ipv4/ipmr_base.c
@@ -268,6 +268,55 @@ int mr_fill_mroute(struct mr_table *mrt, struct sk_buff 
*skb,
 }
 EXPORT_SYMBOL(mr_fill_mroute);
 
+int mr_table_dump(struct mr_table *mrt, struct sk_buff *skb,
+ struct netlink_callback *cb,
+ int (*fill)(struct mr_table *mrt, struct sk_buff *skb,
+ u32 portid, u32 seq, struct mr_mfc *c,
+ int cmd, int flags),
+ spinlock_t *lock)
+{
+   unsigned int e = 0, s_e = cb->args[1];
+   unsigned int flags = NLM_F_MULTI;
+   struct mr_mfc *mfc;
+   int err;
+
+   list_for_each_entry_rcu(mfc, >mfc_cache_list, list) {
+   if (e < s_e)
+   goto next_entry;
+
+   err = fill(mrt, skb, NETLINK_CB(cb->skb).portid,
+  cb->nlh->nlmsg_seq, mfc, RTM_NEWROUTE, flags);
+   if (err < 0)
+   goto out;
+next_entry:
+   e++;
+   }
+   e = 0;
+   s_e = 0;
+
+   spin_lock_bh(lock);
+   list_for_each_entry(mfc, >mfc_unres_queue, list) {
+   if (e < s_e)
+   goto next_entry2;
+
+   err = fill(mrt, skb, NETLINK_CB(cb->skb).portid,
+  cb->nlh->nlmsg_seq, mfc, RTM_NEWROUTE, flags);
+   if (err < 0) {
+   spin_unlock_bh(lock);
+   goto out;
+   }
+next_entry2:
+   e++;
+   }
+   spin_unlock_bh(lock);
+   err = 0;
+   e = 0;
+
+out:
+   cb->args[1] = e;
+   return err;
+}
+
 int mr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb,
 struct mr_table *(*iter)(struct net *net,
  struct mr_table *mrt),
@@ -277,51 +326,24 @@ int mr_rtm_dumproute(struct sk_buff *skb, struct 
netlink_callback *cb,
 int cmd, int flags),
 spinlock_t *lock)
 {
-   unsigned int t = 0, e = 0, s_t = cb->args[0], s_e = cb->args[1];
+   unsigned int t = 0, s_t = cb->args[0];
struct net *net = sock_net(skb->sk);
struct mr_table *mrt;
-   struct mr_mfc *mfc;
+   int err;
 
rcu_read_lock();
for (mrt = iter(net, NULL); mrt; mrt = iter(net, mrt)) {
if (t < s_t)
goto next_table;
-   list_for_each_entry_rcu(mfc, >mfc_cache_list, list) {
-   if (e < s_e)
-   goto next_entry;
-   if (fill(mrt, skb, NETLINK_CB(cb->skb).portid,
-cb->nlh->nlmsg_seq, mfc,
-RTM_NEWROUTE, NLM_F_MULTI) < 0)
-   goto done;
-next_entry:
-   e++;
-   }
-   e = 0;
-   s_e = 0;
-
-   spin_lock_bh(lock);
-   list_for_each_entry(mfc, >mfc_unres_queue, list) {
-   if (e < s_e)
-   goto next_entry2;
-   if (fill(mrt, skb, NETLINK_CB(cb->skb).portid,
-cb->nlh->nlmsg_seq, mfc,
-RTM_NEWROUTE, NLM_F_MULTI) < 0) {
-   spin_unlock_bh(lock);
-   goto done;
-   }
-next_entry2:
-   e++;
-   }
-   spin_unlock_bh(lock);
-   e = 0;
-   s_e = 0;
+
+

[PATCH v2 net-next 01/11] netlink: Add answer_flags to netlink_callback

2018-10-15 Thread David Ahern

From: David Ahern 

With dump filtering we need a way to ensure the NLM_F_DUMP_FILTERED
flag is set on a message back to the user if the data returned is
influenced by some input attributes. Normally this can be done as
messages are added to the skb, but if the filter results in no data
being returned, the user could be confused as to why.

This patch adds answer_flags to the netlink_callback allowing dump
handlers to set the NLM_F_DUMP_FILTERED at a minimum in the
NLMSG_DONE message ensuring the flag gets back to the user.

The netlink_callback space is initialized to 0 via a memset in
__netlink_dump_start, so init of the new answer_flags is covered.

Signed-off-by: David Ahern 
---
 include/linux/netlink.h  | 1 +
 net/netlink/af_netlink.c | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index 72580f1a72a2..4da90a6ab536 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -180,6 +180,7 @@ struct netlink_callback {
u16 family;
u16 min_dump_alloc;
boolstrict_check;
+   u16 answer_flags;
unsigned intprev_seq, seq;
longargs[6];
 };
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index e613a9f89600..6bb9f3cde0b0 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2257,7 +2257,8 @@ static int netlink_dump(struct sock *sk)
}
 
nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE,
-  sizeof(nlk->dump_done_errno), NLM_F_MULTI);
+  sizeof(nlk->dump_done_errno),
+  NLM_F_MULTI | cb->answer_flags);
if (WARN_ON(!nlh))
goto errout_skb;
 
-- 
2.11.0

[PATCH v2 net-next 05/11] net/mpls: Plumb support for filtering route dumps

2018-10-15 Thread David Ahern

From: David Ahern 

Implement kernel side filtering of routes by egress device index and
protocol. MPLS uses only a single table and route type.

Signed-off-by: David Ahern 
---
 net/mpls/af_mpls.c | 42 +-
 1 file changed, 41 insertions(+), 1 deletion(-)

diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index bfcb4759c9ee..48f4cbd9fb38 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -2067,12 +2067,35 @@ static int mpls_valid_fib_dump_req(struct net *net, 
const struct nlmsghdr *nlh,
 }
 #endif
 
+static bool mpls_rt_uses_dev(struct mpls_route *rt,
+const struct net_device *dev)
+{
+   struct net_device *nh_dev;
+
+   if (rt->rt_nhn == 1) {
+   struct mpls_nh *nh = rt->rt_nh;
+
+   nh_dev = rtnl_dereference(nh->nh_dev);
+   if (dev == nh_dev)
+   return true;
+   } else {
+   for_nexthops(rt) {
+   nh_dev = rtnl_dereference(nh->nh_dev);
+   if (nh_dev == dev)
+   return true;
+   } endfor_nexthops(rt);
+   }
+
+   return false;
+}
+
 static int mpls_dump_routes(struct sk_buff *skb, struct netlink_callback *cb)
 {
const struct nlmsghdr *nlh = cb->nlh;
struct net *net = sock_net(skb->sk);
struct mpls_route __rcu **platform_label;
struct fib_dump_filter filter = {};
+   unsigned int flags = NLM_F_MULTI;
size_t platform_labels;
unsigned int index;
 
@@ -2084,6 +2107,14 @@ static int mpls_dump_routes(struct sk_buff *skb, struct 
netlink_callback *cb)
err = mpls_valid_fib_dump_req(net, nlh, , cb->extack);
if (err < 0)
return err;
+
+   /* for MPLS, there is only 1 table with fixed type and flags.
+* If either are set in the filter then return nothing.
+*/
+   if ((filter.table_id && filter.table_id != RT_TABLE_MAIN) ||
+   (filter.rt_type && filter.rt_type != RTN_UNICAST) ||
+filter.flags)
+   return skb->len;
}
 
index = cb->args[0];
@@ -2092,15 +2123,24 @@ static int mpls_dump_routes(struct sk_buff *skb, struct 
netlink_callback *cb)
 
platform_label = rtnl_dereference(net->mpls.platform_label);
platform_labels = net->mpls.platform_labels;
+
+   if (filter.filter_set)
+   flags |= NLM_F_DUMP_FILTERED;
+
for (; index < platform_labels; index++) {
struct mpls_route *rt;
+
rt = rtnl_dereference(platform_label[index]);
if (!rt)
continue;
 
+   if ((filter.dev && !mpls_rt_uses_dev(rt, filter.dev)) ||
+   (filter.protocol && rt->rt_protocol != filter.protocol))
+   continue;
+
if (mpls_dump_route(skb, NETLINK_CB(cb->skb).portid,
cb->nlh->nlmsg_seq, RTM_NEWROUTE,
-   index, rt, NLM_F_MULTI) < 0)
+   index, rt, flags) < 0)
break;
}
cb->args[0] = index;
-- 
2.11.0

[PATCH v2 net-next 11/11] net/ipv4: Bail early if user only wants prefix entries

2018-10-15 Thread David Ahern

From: David Ahern 

Unlike IPv6, IPv4 does not have routes marked with RTF_PREFIX_RT. If the
flag is set in the dump request, just return.

In the process of this change, move the CLONE check to use the new
filter flags.

Signed-off-by: David Ahern 
---
 net/ipv4/fib_frontend.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index e86ca2255181..5bf653f36911 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -886,10 +886,14 @@ static int inet_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
err = ip_valid_fib_dump_req(net, nlh, , cb);
if (err < 0)
return err;
+   } else if (nlmsg_len(nlh) >= sizeof(struct rtmsg)) {
+   struct rtmsg *rtm = nlmsg_data(nlh);
+
+   filter.flags = rtm->rtm_flags & (RTM_F_PREFIX | RTM_F_CLONED);
}
 
-   if (nlmsg_len(nlh) >= sizeof(struct rtmsg) &&
-   ((struct rtmsg *)nlmsg_data(nlh))->rtm_flags & RTM_F_CLONED)
+   /* fib entries are never clones and ipv4 does not use prefix flag */
+   if (filter.flags & (RTM_F_PREFIX | RTM_F_CLONED))
return skb->len;
 
if (filter.table_id) {
-- 
2.11.0

[PATCH v2 net-next 07/11] net: Plumb support for filtering ipv4 and ipv6 multicast route dumps

2018-10-15 Thread David Ahern

From: David Ahern 

Implement kernel side filtering of routes by egress device index and
table id. If the table id is given in the filter, lookup table and
call mr_table_dump directly for it.

Signed-off-by: David Ahern 
---
 include/linux/mroute_base.h |  7 ---
 net/ipv4/ipmr.c | 18 +++---
 net/ipv4/ipmr_base.c| 42 +++---
 net/ipv6/ip6mr.c| 18 +++---
 4 files changed, 73 insertions(+), 12 deletions(-)

diff --git a/include/linux/mroute_base.h b/include/linux/mroute_base.h
index db85373c8d15..34de06b426ef 100644
--- a/include/linux/mroute_base.h
+++ b/include/linux/mroute_base.h
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /**
  * struct vif_device - interface representor for multicast routing
@@ -288,7 +289,7 @@ int mr_table_dump(struct mr_table *mrt, struct sk_buff *skb,
  int (*fill)(struct mr_table *mrt, struct sk_buff *skb,
  u32 portid, u32 seq, struct mr_mfc *c,
  int cmd, int flags),
- spinlock_t *lock);
+ spinlock_t *lock, struct fib_dump_filter *filter);
 int mr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb,
 struct mr_table *(*iter)(struct net *net,
  struct mr_table *mrt),
@@ -296,7 +297,7 @@ int mr_rtm_dumproute(struct sk_buff *skb, struct 
netlink_callback *cb,
 struct sk_buff *skb,
 u32 portid, u32 seq, struct mr_mfc *c,
 int cmd, int flags),
-spinlock_t *lock);
+spinlock_t *lock, struct fib_dump_filter *filter);
 
 int mr_dump(struct net *net, struct notifier_block *nb, unsigned short family,
int (*rules_dump)(struct net *net,
@@ -346,7 +347,7 @@ mr_rtm_dumproute(struct sk_buff *skb, struct 
netlink_callback *cb,
 struct sk_buff *skb,
 u32 portid, u32 seq, struct mr_mfc *c,
 int cmd, int flags),
-spinlock_t *lock)
+spinlock_t *lock, struct fib_dump_filter *filter)
 {
return -EINVAL;
 }
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 44d777058960..3fa988e6a3df 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -2528,18 +2528,30 @@ static int ipmr_rtm_getroute(struct sk_buff *in_skb, 
struct nlmsghdr *nlh,
 static int ipmr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb)
 {
struct fib_dump_filter filter = {};
+   int err;
 
if (cb->strict_check) {
-   int err;
-
err = ip_valid_fib_dump_req(sock_net(skb->sk), cb->nlh,
, cb->extack);
if (err < 0)
return err;
}
 
+   if (filter.table_id) {
+   struct mr_table *mrt;
+
+   mrt = ipmr_get_table(sock_net(skb->sk), filter.table_id);
+   if (!mrt) {
+   NL_SET_ERR_MSG(cb->extack, "ipv4: MR table does not 
exist");
+   return -ENOENT;
+   }
+   err = mr_table_dump(mrt, skb, cb, _ipmr_fill_mroute,
+   _unres_lock, );
+   return skb->len ? : err;
+   }
+
return mr_rtm_dumproute(skb, cb, ipmr_mr_table_iter,
-   _ipmr_fill_mroute, _unres_lock);
+   _ipmr_fill_mroute, _unres_lock, );
 }
 
 static const struct nla_policy rtm_ipmr_policy[RTA_MAX + 1] = {
diff --git a/net/ipv4/ipmr_base.c b/net/ipv4/ipmr_base.c
index 132dd2613ca5..bfe8fd04afa0 100644
--- a/net/ipv4/ipmr_base.c
+++ b/net/ipv4/ipmr_base.c
@@ -268,21 +268,45 @@ int mr_fill_mroute(struct mr_table *mrt, struct sk_buff 
*skb,
 }
 EXPORT_SYMBOL(mr_fill_mroute);
 
+static bool mr_mfc_uses_dev(const struct mr_table *mrt,
+   const struct mr_mfc *c,
+   const struct net_device *dev)
+{
+   int ct;
+
+   for (ct = c->mfc_un.res.minvif; ct < c->mfc_un.res.maxvif; ct++) {
+   if (VIF_EXISTS(mrt, ct) && c->mfc_un.res.ttls[ct] < 255) {
+   const struct vif_device *vif;
+
+   vif = >vif_table[ct];
+   if (vif->dev == dev)
+   return true;
+   }
+   }
+   return false;
+}
+
 int mr_table_dump(struct mr_table *mrt, struct sk_buff *skb,
  struct netlink_callback *cb,
  int (*fill)(struct mr_table *mrt, struct sk_buff *skb,
  u32 portid, u32 seq, struct mr_mfc *c,
  int cmd, int flags),
- spinlock_t *lock)
+ spinlock_t *lock, struct fib_dump_filter *filter)
 {
unsigned int e = 0,

[PATCH v2 net-next 10/11] net/ipv6: Bail early if user only wants cloned entries

2018-10-15 Thread David Ahern

From: David Ahern 

Similar to IPv4, IPv6 fib no longer contains cloned routes. If a user
requests a route dump for only cloned entries, no sense walking the FIB
and returning everything.

Signed-off-by: David Ahern 
---
 net/ipv6/ip6_fib.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 5562c77022c6..2a058b408a6a 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -586,10 +586,13 @@ static int inet6_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
} else if (nlmsg_len(nlh) >= sizeof(struct rtmsg)) {
struct rtmsg *rtm = nlmsg_data(nlh);
 
-   if (rtm->rtm_flags & RTM_F_PREFIX)
-   arg.filter.flags = RTM_F_PREFIX;
+   arg.filter.flags = rtm->rtm_flags & (RTM_F_PREFIX|RTM_F_CLONED);
}
 
+   /* fib entries are never clones */
+   if (arg.filter.flags & RTM_F_CLONED)
+   return skb->len;
+
w = (void *)cb->args[2];
if (!w) {
/* New dump:
-- 
2.11.0

[PATCH v2 net-next 09/11] net/mpls: Handle kernel side filtering of route dumps

2018-10-15 Thread David Ahern

From: David Ahern 

Update the dump request parsing in MPLS for the non-INET case to
enable kernel side filtering. If INET is disabled the only filters
that make sense for MPLS are protocol and nexthop device.

Signed-off-by: David Ahern 
---
 net/mpls/af_mpls.c | 33 -
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 24381696932a..7d55d4c04088 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -2044,7 +2044,9 @@ static int mpls_valid_fib_dump_req(struct net *net, const 
struct nlmsghdr *nlh,
   struct netlink_callback *cb)
 {
struct netlink_ext_ack *extack = cb->extack;
+   struct nlattr *tb[RTA_MAX + 1];
struct rtmsg *rtm;
+   int err, i;
 
if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*rtm))) {
NL_SET_ERR_MSG_MOD(extack, "Invalid header for FIB dump 
request");
@@ -2053,15 +2055,36 @@ static int mpls_valid_fib_dump_req(struct net *net, 
const struct nlmsghdr *nlh,
 
rtm = nlmsg_data(nlh);
if (rtm->rtm_dst_len || rtm->rtm_src_len  || rtm->rtm_tos   ||
-   rtm->rtm_table   || rtm->rtm_protocol || rtm->rtm_scope ||
-   rtm->rtm_type|| rtm->rtm_flags) {
+   rtm->rtm_table   || rtm->rtm_scope|| rtm->rtm_type  ||
+   rtm->rtm_flags) {
NL_SET_ERR_MSG_MOD(extack, "Invalid values in header for FIB 
dump request");
return -EINVAL;
}
 
-   if (nlmsg_attrlen(nlh, sizeof(*rtm))) {
-   NL_SET_ERR_MSG_MOD(extack, "Invalid data after header in FIB 
dump request");
-   return -EINVAL;
+   if (rtm->rtm_protocol) {
+   filter->protocol = rtm->rtm_protocol;
+   filter->filter_set = 1;
+   cb->answer_flags = NLM_F_DUMP_FILTERED;
+   }
+
+   err = nlmsg_parse_strict(nlh, sizeof(*rtm), tb, RTA_MAX,
+rtm_mpls_policy, extack);
+   if (err < 0)
+   return err;
+
+   for (i = 0; i <= RTA_MAX; ++i) {
+   int ifindex;
+
+   if (i == RTA_OIF) {
+   ifindex = nla_get_u32(tb[i]);
+   filter->dev = __dev_get_by_index(net, ifindex);
+   if (!filter->dev)
+   return -ENODEV;
+   filter->filter_set = 1;
+   } else if (tb[i]) {
+   NL_SET_ERR_MSG_MOD(extack, "Unsupported attribute in 
dump request");
+   return -EINVAL;
+   }
}
 
return 0;
-- 
2.11.0

[PATCH v2 net-next 04/11] net/ipv6: Plumb support for filtering route dumps

2018-10-15 Thread David Ahern

From: David Ahern 

Implement kernel side filtering of routes by table id, egress device
index, protocol, and route type. If the table id is given in the filter,
lookup the table and call fib6_dump_table directly for it.

Move the existing route flags check for prefix only routes to the new
filter.

Signed-off-by: David Ahern 
---
 net/ipv6/ip6_fib.c | 28 ++--
 net/ipv6/route.c   | 40 
 2 files changed, 54 insertions(+), 14 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 94e61fe47ff8..a51fc357a05c 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -583,10 +583,12 @@ static int inet6_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
err = ip_valid_fib_dump_req(net, nlh, , cb->extack);
if (err < 0)
return err;
-   }
+   } else if (nlmsg_len(nlh) >= sizeof(struct rtmsg)) {
+   struct rtmsg *rtm = nlmsg_data(nlh);
 
-   s_h = cb->args[0];
-   s_e = cb->args[1];
+   if (rtm->rtm_flags & RTM_F_PREFIX)
+   arg.filter.flags = RTM_F_PREFIX;
+   }
 
w = (void *)cb->args[2];
if (!w) {
@@ -612,6 +614,20 @@ static int inet6_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
arg.net = net;
w->args = 
 
+   if (arg.filter.table_id) {
+   tb = fib6_get_table(net, arg.filter.table_id);
+   if (!tb) {
+   NL_SET_ERR_MSG_MOD(cb->extack, "FIB table does not 
exist");
+   return -ENOENT;
+   }
+
+   res = fib6_dump_table(tb, skb, cb);
+   goto out;
+   }
+
+   s_h = cb->args[0];
+   s_e = cb->args[1];
+
rcu_read_lock();
for (h = s_h; h < FIB6_TABLE_HASHSZ; h++, s_e = 0) {
e = 0;
@@ -621,16 +637,16 @@ static int inet6_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
goto next;
res = fib6_dump_table(tb, skb, cb);
if (res != 0)
-   goto out;
+   goto out_unlock;
 next:
e++;
}
}
-out:
+out_unlock:
rcu_read_unlock();
cb->args[1] = e;
cb->args[0] = h;
-
+out:
res = res < 0 ? res : skb->len;
if (res <= 0)
fib6_dump_end(cb);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f4e08b0689a8..9fd600e42f9d 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -4767,28 +4767,52 @@ static int rt6_fill_node(struct net *net, struct 
sk_buff *skb,
return -EMSGSIZE;
 }
 
+static bool fib6_info_uses_dev(const struct fib6_info *f6i,
+  const struct net_device *dev)
+{
+   if (f6i->fib6_nh.nh_dev == dev)
+   return true;
+
+   if (f6i->fib6_nsiblings) {
+   struct fib6_info *sibling, *next_sibling;
+
+   list_for_each_entry_safe(sibling, next_sibling,
+>fib6_siblings, fib6_siblings) {
+   if (sibling->fib6_nh.nh_dev == dev)
+   return true;
+   }
+   }
+
+   return false;
+}
+
 int rt6_dump_route(struct fib6_info *rt, void *p_arg)
 {
struct rt6_rtnl_dump_arg *arg = (struct rt6_rtnl_dump_arg *) p_arg;
+   struct fib_dump_filter *filter = >filter;
+   unsigned int flags = NLM_F_MULTI;
struct net *net = arg->net;
 
if (rt == net->ipv6.fib6_null_entry)
return 0;
 
-   if (nlmsg_len(arg->cb->nlh) >= sizeof(struct rtmsg)) {
-   struct rtmsg *rtm = nlmsg_data(arg->cb->nlh);
-
-   /* user wants prefix routes only */
-   if (rtm->rtm_flags & RTM_F_PREFIX &&
-   !(rt->fib6_flags & RTF_PREFIX_RT)) {
-   /* success since this is not a prefix route */
+   if ((filter->flags & RTM_F_PREFIX) &&
+   !(rt->fib6_flags & RTF_PREFIX_RT)) {
+   /* success since this is not a prefix route */
+   return 1;
+   }
+   if (filter->filter_set) {
+   if ((filter->rt_type && rt->fib6_type != filter->rt_type) ||
+   (filter->dev && !fib6_info_uses_dev(rt, filter->dev)) ||
+   (filter->protocol && rt->fib6_protocol != 
filter->protocol)) {
return 1;
}
+   flags |= NLM_F_DUMP_FILTERED;
}
 
return rt6_fill_node(net, arg->skb, rt, NULL, NULL, NULL, 0,
 RTM_NEWROUTE, NETLINK_CB(arg->cb->skb).portid,
-arg->cb->nlh->nlmsg_seq, NLM_F_MULTI);
+arg->cb->nlh->nlmsg_seq, flags);
 }
 
 static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh,
-- 
2.11.0

[PATCH v2 net-next 02/11] net: Add struct for fib dump filter

2018-10-15 Thread David Ahern

From: David Ahern 

Add struct fib_dump_filter for options on limiting which routes are
returned in a dump request. The current list is table id, protocol,
route type, rtm_flags and nexthop device index. struct net is needed
to lookup the net_device from the index.

Declare the filter for each route dump handler and plumb the new
arguments from dump handlers to ip_valid_fib_dump_req.

Signed-off-by: David Ahern 
---
 include/net/ip6_route.h |  1 +
 include/net/ip_fib.h| 13 -
 net/ipv4/fib_frontend.c |  6 --
 net/ipv4/ipmr.c |  6 +-
 net/ipv6/ip6_fib.c  |  5 +++--
 net/ipv6/ip6mr.c|  5 -
 net/mpls/af_mpls.c  | 12 
 7 files changed, 37 insertions(+), 11 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index cef186dbd2ce..7ab119936e69 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -174,6 +174,7 @@ struct rt6_rtnl_dump_arg {
struct sk_buff *skb;
struct netlink_callback *cb;
struct net *net;
+   struct fib_dump_filter filter;
 };
 
 int rt6_dump_route(struct fib6_info *f6i, void *p_arg);
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 852e4ebf2209..667013bf4266 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -222,6 +222,16 @@ struct fib_table {
unsigned long   __data[0];
 };
 
+struct fib_dump_filter {
+   u32 table_id;
+   /* filter_set is an optimization that an entry is set */
+   boolfilter_set;
+   unsigned char   protocol;
+   unsigned char   rt_type;
+   unsigned intflags;
+   struct net_device   *dev;
+};
+
 int fib_table_lookup(struct fib_table *tb, const struct flowi4 *flp,
 struct fib_result *res, int fib_flags);
 int fib_table_insert(struct net *, struct fib_table *, struct fib_config *,
@@ -453,6 +463,7 @@ static inline void fib_proc_exit(struct net *net)
 
 u32 ip_mtu_from_fib_result(struct fib_result *res, __be32 daddr);
 
-int ip_valid_fib_dump_req(const struct nlmsghdr *nlh,
+int ip_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh,
+ struct fib_dump_filter *filter,
  struct netlink_ext_ack *extack);
 #endif  /* _NET_FIB_H */
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 0f1beceb47d5..850850dd80e1 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -802,7 +802,8 @@ static int inet_rtm_newroute(struct sk_buff *skb, struct 
nlmsghdr *nlh,
return err;
 }
 
-int ip_valid_fib_dump_req(const struct nlmsghdr *nlh,
+int ip_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh,
+ struct fib_dump_filter *filter,
  struct netlink_ext_ack *extack)
 {
struct rtmsg *rtm;
@@ -837,6 +838,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
 {
const struct nlmsghdr *nlh = cb->nlh;
struct net *net = sock_net(skb->sk);
+   struct fib_dump_filter filter = {};
unsigned int h, s_h;
unsigned int e = 0, s_e;
struct fib_table *tb;
@@ -844,7 +846,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
int dumped = 0, err;
 
if (cb->strict_check) {
-   err = ip_valid_fib_dump_req(nlh, cb->extack);
+   err = ip_valid_fib_dump_req(net, nlh, , cb->extack);
if (err < 0)
return err;
}
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 91b0d5671649..44d777058960 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -2527,9 +2527,13 @@ static int ipmr_rtm_getroute(struct sk_buff *in_skb, 
struct nlmsghdr *nlh,
 
 static int ipmr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb)
 {
+   struct fib_dump_filter filter = {};
+
if (cb->strict_check) {
-   int err = ip_valid_fib_dump_req(cb->nlh, cb->extack);
+   int err;
 
+   err = ip_valid_fib_dump_req(sock_net(skb->sk), cb->nlh,
+   , cb->extack);
if (err < 0)
return err;
}
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 0783af11b0b7..94e61fe47ff8 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -569,17 +569,18 @@ static int inet6_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
 {
const struct nlmsghdr *nlh = cb->nlh;
struct net *net = sock_net(skb->sk);
+   struct rt6_rtnl_dump_arg arg = {};
unsigned int h, s_h;
unsigned int e = 0, s_e;
-   struct rt6_rtnl_dump_arg arg;
struct fib6_walker *w;
struct fib6_table *tb;
struct hlist_head *head;
int res = 0;
 
if (cb->strict_check) {
-   int err = ip_valid_fib_dump_req(nlh, cb->extack);
+

[PATCH v2 net-next 08/11] net: Enable kernel side filtering of route dumps

2018-10-15 Thread David Ahern

From: David Ahern 

Update parsing of route dump request to enable kernel side filtering.
Allow filtering results by protocol (e.g., which routing daemon installed
the route), route type (e.g., unicast), table id and nexthop device. These
amount to the low hanging fruit, yet a huge improvement, for dumping
routes.

ip_valid_fib_dump_req is called with RTNL held, so __dev_get_by_index can
be used to look up the device index without taking a reference. From
there filter->dev is only used during dump loops with the lock still held.

Set NLM_F_DUMP_FILTERED in the answer_flags so the user knows the results
have been filtered should no entries be returned.

Signed-off-by: David Ahern 
---
 include/net/ip_fib.h|  2 +-
 net/ipv4/fib_frontend.c | 51 ++---
 net/ipv4/ipmr.c |  2 +-
 net/ipv6/ip6_fib.c  |  2 +-
 net/ipv6/ip6mr.c|  2 +-
 net/mpls/af_mpls.c  |  9 +
 6 files changed, 53 insertions(+), 15 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 1eabc9edd2b9..e8d9456bf36e 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -465,5 +465,5 @@ u32 ip_mtu_from_fib_result(struct fib_result *res, __be32 
daddr);
 
 int ip_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh,
  struct fib_dump_filter *filter,
- struct netlink_ext_ack *extack);
+ struct netlink_callback *cb);
 #endif  /* _NET_FIB_H */
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 37dc8ac366fd..e86ca2255181 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -804,9 +804,14 @@ static int inet_rtm_newroute(struct sk_buff *skb, struct 
nlmsghdr *nlh,
 
 int ip_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh,
  struct fib_dump_filter *filter,
- struct netlink_ext_ack *extack)
+ struct netlink_callback *cb)
 {
+   struct netlink_ext_ack *extack = cb->extack;
+   struct nlattr *tb[RTA_MAX + 1];
struct rtmsg *rtm;
+   int err, i;
+
+   ASSERT_RTNL();
 
if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*rtm))) {
NL_SET_ERR_MSG(extack, "Invalid header for FIB dump request");
@@ -815,8 +820,7 @@ int ip_valid_fib_dump_req(struct net *net, const struct 
nlmsghdr *nlh,
 
rtm = nlmsg_data(nlh);
if (rtm->rtm_dst_len || rtm->rtm_src_len  || rtm->rtm_tos   ||
-   rtm->rtm_table   || rtm->rtm_protocol || rtm->rtm_scope ||
-   rtm->rtm_type) {
+   rtm->rtm_scope) {
NL_SET_ERR_MSG(extack, "Invalid values in header for FIB dump 
request");
return -EINVAL;
}
@@ -825,9 +829,42 @@ int ip_valid_fib_dump_req(struct net *net, const struct 
nlmsghdr *nlh,
return -EINVAL;
}
 
-   if (nlmsg_attrlen(nlh, sizeof(*rtm))) {
-   NL_SET_ERR_MSG(extack, "Invalid data after header in FIB dump 
request");
-   return -EINVAL;
+   filter->flags= rtm->rtm_flags;
+   filter->protocol = rtm->rtm_protocol;
+   filter->rt_type  = rtm->rtm_type;
+   filter->table_id = rtm->rtm_table;
+
+   err = nlmsg_parse_strict(nlh, sizeof(*rtm), tb, RTA_MAX,
+rtm_ipv4_policy, extack);
+   if (err < 0)
+   return err;
+
+   for (i = 0; i <= RTA_MAX; ++i) {
+   int ifindex;
+
+   if (!tb[i])
+   continue;
+
+   switch (i) {
+   case RTA_TABLE:
+   filter->table_id = nla_get_u32(tb[i]);
+   break;
+   case RTA_OIF:
+   ifindex = nla_get_u32(tb[i]);
+   filter->dev = __dev_get_by_index(net, ifindex);
+   if (!filter->dev)
+   return -ENODEV;
+   break;
+   default:
+   NL_SET_ERR_MSG(extack, "Unsupported attribute in dump 
request");
+   return -EINVAL;
+   }
+   }
+
+   if (filter->flags || filter->protocol || filter->rt_type ||
+   filter->table_id || filter->dev) {
+   filter->filter_set = 1;
+   cb->answer_flags = NLM_F_DUMP_FILTERED;
}
 
return 0;
@@ -846,7 +883,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
int dumped = 0, err;
 
if (cb->strict_check) {
-   err = ip_valid_fib_dump_req(net, nlh, , cb->extack);
+   err = ip_valid_fib_dump_req(net, nlh, , cb);
if (err < 0)
return err;
}
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 3fa988e6a3df..7a3e2acda94c 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -2532,7 +2532,7 @@ static int ipmr_rtm_dumproute(struct sk_buff *skb,

[PATCH v2 net-next 00/11] net: Kernel side filtering for route dumps

2018-10-15 Thread David Ahern

From: David Ahern 

Implement kernel side filtering of route dumps by protocol (e.g., which
routing daemon installed the route), route type (e.g., unicast), table
id and nexthop device.

iproute2 has been doing this filtering in userspace for years; pushing
the filters to the kernel side reduces the amount of data the kernel
sends and reduces wasted cycles on both sides processing unwanted data.
These initial options provide a huge improvement for efficiently
examining routes on large scale systems.

v2
- better handling of requests for a specific table. Rather than walking
  the hash of all tables, lookup the specific table and dump it
- refactor mr_rtm_dumproute moving the loop over the table into a
  helper that can be invoked directly
- add hook to return NLM_F_DUMP_FILTERED in DONE message to ensure
  it is returned even when the dump returns nothing

David Ahern (11):
  netlink: Add answer_flags to netlink_callback
  net: Add struct for fib dump filter
  net/ipv4: Plumb support for filtering route dumps
  net/ipv6: Plumb support for filtering route dumps
  net/mpls: Plumb support for filtering route dumps
  ipmr: Refactor mr_rtm_dumproute
  net: Plumb support for filtering ipv4 and ipv6 multicast route dumps
  net: Enable kernel side filtering of route dumps
  net/mpls: Handle kernel side filtering of route dumps
  net/ipv6: Bail early if user only wants cloned entries
  net/ipv4: Bail early if user only wants prefix entries

 include/linux/mroute_base.h |  11 +++-
 include/linux/netlink.h |   1 +
 include/net/ip6_route.h |   1 +
 include/net/ip_fib.h|  17 --
 net/ipv4/fib_frontend.c |  76 ++
 net/ipv4/fib_trie.c |  37 +
 net/ipv4/ipmr.c |  22 ++--
 net/ipv4/ipmr_base.c| 126 
 net/ipv6/ip6_fib.c  |  34 +---
 net/ipv6/ip6mr.c|  21 ++--
 net/ipv6/route.c|  40 +++---
 net/mpls/af_mpls.c  |  92 +++-
 net/netlink/af_netlink.c|   3 +-
 13 files changed, 386 insertions(+), 95 deletions(-)

-- 
2.11.0

[PATCH v2 net-next 03/11] net/ipv4: Plumb support for filtering route dumps

2018-10-15 Thread David Ahern

From: David Ahern 

Implement kernel side filtering of routes by table id, egress device index,
protocol and route type. If the table id is given in the filter, lookup the
table and call fib_table_dump directly for it.

Signed-off-by: David Ahern 
---
 include/net/ip_fib.h|  2 +-
 net/ipv4/fib_frontend.c | 13 -
 net/ipv4/fib_trie.c | 37 ++---
 3 files changed, 39 insertions(+), 13 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 667013bf4266..1eabc9edd2b9 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -239,7 +239,7 @@ int fib_table_insert(struct net *, struct fib_table *, 
struct fib_config *,
 int fib_table_delete(struct net *, struct fib_table *, struct fib_config *,
 struct netlink_ext_ack *extack);
 int fib_table_dump(struct fib_table *table, struct sk_buff *skb,
-  struct netlink_callback *cb);
+  struct netlink_callback *cb, struct fib_dump_filter *filter);
 int fib_table_flush(struct net *net, struct fib_table *table);
 struct fib_table *fib_trie_unmerge(struct fib_table *main_tb);
 void fib_table_flush_external(struct fib_table *table);
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 850850dd80e1..37dc8ac366fd 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -855,6 +855,17 @@ static int inet_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
((struct rtmsg *)nlmsg_data(nlh))->rtm_flags & RTM_F_CLONED)
return skb->len;
 
+   if (filter.table_id) {
+   tb = fib_get_table(net, filter.table_id);
+   if (!tb) {
+   NL_SET_ERR_MSG(cb->extack, "ipv4: FIB table does not 
exist");
+   return -ENOENT;
+   }
+
+   err = fib_table_dump(tb, skb, cb, );
+   return skb->len ? : err;
+   }
+
s_h = cb->args[0];
s_e = cb->args[1];
 
@@ -869,7 +880,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
if (dumped)
memset(>args[2], 0, sizeof(cb->args) -
 2 * sizeof(cb->args[0]));
-   err = fib_table_dump(tb, skb, cb);
+   err = fib_table_dump(tb, skb, cb, );
if (err < 0) {
if (likely(skb->len))
goto out;
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 5bc0c89e81e4..237c9f72b265 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -2003,12 +2003,17 @@ void fib_free_table(struct fib_table *tb)
 }
 
 static int fn_trie_dump_leaf(struct key_vector *l, struct fib_table *tb,
-struct sk_buff *skb, struct netlink_callback *cb)
+struct sk_buff *skb, struct netlink_callback *cb,
+struct fib_dump_filter *filter)
 {
+   unsigned int flags = NLM_F_MULTI;
__be32 xkey = htonl(l->key);
struct fib_alias *fa;
int i, s_i;
 
+   if (filter->filter_set)
+   flags |= NLM_F_DUMP_FILTERED;
+
s_i = cb->args[4];
i = 0;
 
@@ -2016,25 +2021,35 @@ static int fn_trie_dump_leaf(struct key_vector *l, 
struct fib_table *tb,
hlist_for_each_entry_rcu(fa, >leaf, fa_list) {
int err;
 
-   if (i < s_i) {
-   i++;
-   continue;
-   }
+   if (i < s_i)
+   goto next;
 
-   if (tb->tb_id != fa->tb_id) {
-   i++;
-   continue;
+   if (tb->tb_id != fa->tb_id)
+   goto next;
+
+   if (filter->filter_set) {
+   if (filter->rt_type && fa->fa_type != filter->rt_type)
+   goto next;
+
+   if ((filter->protocol &&
+fa->fa_info->fib_protocol != filter->protocol))
+   goto next;
+
+   if (filter->dev &&
+   !fib_info_nh_uses_dev(fa->fa_info, filter->dev))
+   goto next;
}
 
err = fib_dump_info(skb, NETLINK_CB(cb->skb).portid,
cb->nlh->nlmsg_seq, RTM_NEWROUTE,
tb->tb_id, fa->fa_type,
xkey, KEYLENGTH - fa->fa_slen,
-   fa->fa_tos, fa->fa_info, NLM_F_MULTI);
+   fa->fa_tos, fa->fa_info, flags);
if (err < 0) {
cb->args[4] = i;
return err;
}
+next:
i++;
}
 
@@ -2044,7 +2059,7 @@ static int

Re: [PATCH bpf-next v2 7/8] bpf: add tls support for testing in test_sockmap

2018-10-15 Thread Daniel Borkmann

On 10/16/2018 02:42 AM, Andrey Ignatov wrote:
> Hi Daniel and John!
> 
> Daniel Borkmann  [Fri, 2018-10-12 17:46 -0700]:
>> From: John Fastabend 
>>
>> This adds a --ktls option to test_sockmap in order to enable the
>> combination of ktls and sockmap to run, which makes for another
>> batch of 648 test cases for both in combination.
>>
>> Signed-off-by: John Fastabend 
>> Signed-off-by: Daniel Borkmann 
>> ---
>>  tools/testing/selftests/bpf/test_sockmap.c | 89 
>> ++
>>  1 file changed, 89 insertions(+)
>>
>> diff --git a/tools/testing/selftests/bpf/test_sockmap.c 
>> b/tools/testing/selftests/bpf/test_sockmap.c
>> index ac7de38..10a5fa8 100644
>> --- a/tools/testing/selftests/bpf/test_sockmap.c
>> +++ b/tools/testing/selftests/bpf/test_sockmap.c
>> @@ -71,6 +71,7 @@ int txmsg_start;
>>  int txmsg_end;
>>  int txmsg_ingress;
>>  int txmsg_skb;
>> +int ktls;
>>  
>>  static const struct option long_options[] = {
>>  {"help",no_argument,NULL, 'h' },
>> @@ -92,6 +93,7 @@ static const struct option long_options[] = {
>>  {"txmsg_end",   required_argument,  NULL, 'e'},
>>  {"txmsg_ingress", no_argument,  _ingress, 1 },
>>  {"txmsg_skb", no_argument,  _skb, 1 },
>> +{"ktls", no_argument,   , 1 },
>>  {0, 0, NULL, 0 }
>>  };
>>  
>> @@ -112,6 +114,76 @@ static void usage(char *argv[])
>>  printf("\n");
>>  }
>>  
>> +#define TCP_ULP 31
>> +#define TLS_TX 1
>> +#define TLS_RX 2
>> +#include 
> 
> This breaks selftest build for me:
>   test_sockmap.c:120:23: fatal error: linux/tls.h: No such file or directory
>#include 
>  ^
>   compilation terminated.
> 
> Should include/uapi/linux/tls.h be copied to tools/ not to depend on
> host headers?

Good point, yes, that should happen; will send a fix tomorrow morning.

Thanks,
Daniel

Re: [PATCH bpf-next v2 7/8] bpf: add tls support for testing in test_sockmap

2018-10-15 Thread Andrey Ignatov

Hi Daniel and John!

Daniel Borkmann  [Fri, 2018-10-12 17:46 -0700]:
> From: John Fastabend 
> 
> This adds a --ktls option to test_sockmap in order to enable the
> combination of ktls and sockmap to run, which makes for another
> batch of 648 test cases for both in combination.
> 
> Signed-off-by: John Fastabend 
> Signed-off-by: Daniel Borkmann 
> ---
>  tools/testing/selftests/bpf/test_sockmap.c | 89 
> ++
>  1 file changed, 89 insertions(+)
> 
> diff --git a/tools/testing/selftests/bpf/test_sockmap.c 
> b/tools/testing/selftests/bpf/test_sockmap.c
> index ac7de38..10a5fa8 100644
> --- a/tools/testing/selftests/bpf/test_sockmap.c
> +++ b/tools/testing/selftests/bpf/test_sockmap.c
> @@ -71,6 +71,7 @@ int txmsg_start;
>  int txmsg_end;
>  int txmsg_ingress;
>  int txmsg_skb;
> +int ktls;
>  
>  static const struct option long_options[] = {
>   {"help",no_argument,NULL, 'h' },
> @@ -92,6 +93,7 @@ static const struct option long_options[] = {
>   {"txmsg_end",   required_argument,  NULL, 'e'},
>   {"txmsg_ingress", no_argument,  _ingress, 1 },
>   {"txmsg_skb", no_argument,  _skb, 1 },
> + {"ktls", no_argument,   , 1 },
>   {0, 0, NULL, 0 }
>  };
>  
> @@ -112,6 +114,76 @@ static void usage(char *argv[])
>   printf("\n");
>  }
>  
> +#define TCP_ULP 31
> +#define TLS_TX 1
> +#define TLS_RX 2
> +#include 

This breaks selftest build for me:
  test_sockmap.c:120:23: fatal error: linux/tls.h: No such file or directory
   #include 
 ^
  compilation terminated.

Should include/uapi/linux/tls.h be copied to tools/ not to depend on
host headers?

> +
> +char *sock_to_string(int s)
> +{
> + if (s == c1)
> + return "client1";
> + else if (s == c2)
> + return "client2";
> + else if (s == s1)
> + return "server1";
> + else if (s == s2)
> + return "server2";
> + else if (s == p1)
> + return "peer1";
> + else if (s == p2)
> + return "peer2";
> + else
> + return "unknown";
> +}
> +
> +static int sockmap_init_ktls(int verbose, int s)
> +{
> + struct tls12_crypto_info_aes_gcm_128 tls_tx = {
> + .info = {
> + .version = TLS_1_2_VERSION,
> + .cipher_type = TLS_CIPHER_AES_GCM_128,
> + },
> + };
> + struct tls12_crypto_info_aes_gcm_128 tls_rx = {
> + .info = {
> + .version = TLS_1_2_VERSION,
> + .cipher_type = TLS_CIPHER_AES_GCM_128,
> + },
> + };
> + int so_buf = 6553500;
> + int err;
> +
> + err = setsockopt(s, 6, TCP_ULP, "tls", sizeof("tls"));
> + if (err) {
> + fprintf(stderr, "setsockopt: TCP_ULP(%s) failed with error 
> %i\n", sock_to_string(s), err);
> + return -EINVAL;
> + }
> + err = setsockopt(s, SOL_TLS, TLS_TX, (void *)_tx, sizeof(tls_tx));
> + if (err) {
> + fprintf(stderr, "setsockopt: TLS_TX(%s) failed with error 
> %i\n", sock_to_string(s), err);
> + return -EINVAL;
> + }
> + err = setsockopt(s, SOL_TLS, TLS_RX, (void *)_rx, sizeof(tls_rx));
> + if (err) {
> + fprintf(stderr, "setsockopt: TLS_RX(%s) failed with error 
> %i\n", sock_to_string(s), err);
> + return -EINVAL;
> + }
> + err = setsockopt(s, SOL_SOCKET, SO_SNDBUF, _buf, sizeof(so_buf));
> + if (err) {
> + fprintf(stderr, "setsockopt: (%s) failed sndbuf with error 
> %i\n", sock_to_string(s), err);
> + return -EINVAL;
> + }
> + err = setsockopt(s, SOL_SOCKET, SO_RCVBUF, _buf, sizeof(so_buf));
> + if (err) {
> + fprintf(stderr, "setsockopt: (%s) failed rcvbuf with error 
> %i\n", sock_to_string(s), err);
> + return -EINVAL;
> + }
> +
> + if (verbose)
> + fprintf(stdout, "socket(%s) kTLS enabled\n", sock_to_string(s));
> + return 0;
> +}
>  static int sockmap_init_sockets(int verbose)
>  {
>   int i, err, one = 1;
> @@ -456,6 +528,21 @@ static int sendmsg_test(struct sockmap_options *opt)
>   else
>   rx_fd = p2;
>  
> + if (ktls) {
> + /* Redirecting into non-TLS socket which sends into a TLS
> +  * socket is not a valid test. So in this case lets not
> +  * enable kTLS but still run the test.
> +  */
> + if (!txmsg_redir || (txmsg_redir && txmsg_ingress)) {
> + err = sockmap_init_ktls(opt->verbose, rx_fd);
> + if (err)
> + return err;
> + }
> + err = sockmap_init_ktls(opt->verbose, c1);
> + if (err)
> + return err;
> + }
> +
>   rxpid = fork();
>   if (rxpid == 0) {
>   if (opt->drop_expected)
> @@ -907,6 +994,8 @@ static

pull-request: bpf-next 2018-10-16

2018-10-15 Thread Daniel Borkmann

Hi David,

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Convert BPF sockmap and kTLS to both use a new sk_msg API and enable
   sk_msg BPF integration for the latter, from Daniel and John.

2) Enable BPF syscall side to indicate for maps that they do not support
   a map lookup operation as opposed to just missing key, from Prashant.

3) Add bpftool map create command which after map creation pins the
   map into bpf fs for further processing, from Jakub.

4) Add bpftool support for attaching programs to maps allowing sock_map
   and sock_hash to be used from bpftool, from John.

5) Improve syscall BPF map update/delete path for map-in-map types to
   wait a RCU grace period for pending references to complete, from Daniel.

6) Couple of follow-up fixes for the BPF socket lookup to get it
   enabled also when IPv6 is compiled as a module, from Joe.

7) Fix a generic-XDP bug to handle the case when the Ethernet header
   was mangled and thus update skb's protocol and data, from Jesper.

8) Add a missing BTF header length check between header copies from
   user space, from Wenwen.

9) Minor fixups in libbpf to use __u32 instead u32 types and include
   proper perf_event.h uapi header instead of perf internal one, from Yonghong.

10) Allow to pass user-defined flags through EXTRA_CFLAGS and EXTRA_LDFLAGS
to bpftool's build, from Jiri.

11) BPF kselftest tweaks to add LWTUNNEL to config fragment and to install
with_addr.sh script from flow dissector selftest, from Anders.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git

Thanks a lot!



The following changes since commit 071a234ad744ab9a1e9c948874d5f646a2964734:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next (2018-10-08 
23:42:44 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git 

for you to fetch changes up to 0b592b5a01bef5416472ec610d3191e019c144a5:

  tools: bpftool: add map create command (2018-10-15 16:39:21 -0700)


Alexei Starovoitov (5):
  Merge branch 'unsupported-map-lookup'
  Merge branch 'xdp-vlan'
  Merge branch 'sockmap_and_ktls'
  Merge branch 'ipv6_sk_lookup_fixes'
  Merge branch 'bpftool_sockmap'

Anders Roxell (2):
  selftests: bpf: add config fragment LWTUNNEL
  selftests: bpf: install script with_addr.sh

Daniel Borkmann (5):
  tcp, ulp: enforce sock_owned_by_me upon ulp init and cleanup
  tcp, ulp: remove ulp bits from sockmap
  bpf, sockmap: convert to generic sk_msg interface
  tls: convert to generic sk_msg interface
  bpf, doc: add maintainers entry to related files

Daniel Colascione (1):
  bpf: wait for running BPF programs when updating map-in-map

Jakub Kicinski (1):
  tools: bpftool: add map create command

Jesper Dangaard Brouer (3):
  net: fix generic XDP to handle if eth header was mangled
  bpf: make TC vlan bpf_helpers avail to selftests
  selftests/bpf: add XDP selftests for modifying and popping VLAN headers

Jiri Olsa (2):
  bpftool: Allow to add compiler flags via EXTRA_CFLAGS variable
  bpftool: Allow add linker flags via EXTRA_LDFLAGS variable

Joe Stringer (3):
  bpf: Fix dev pointer dereference from sk_skb
  bpf: Allow sk_lookup with IPv6 module
  bpf: Fix IPv6 dport byte-order in bpf_sk_lookup

John Fastabend (5):
  tls: replace poll implementation with read hook
  tls: add bpf support to sk_msg handling
  bpf: add tls support for testing in test_sockmap
  bpf: bpftool, add support for attaching programs to maps
  bpf: bpftool, add flag to allow non-compat map definitions

Prashant Bhole (6):
  bpf: error handling when map_lookup_elem isn't supported
  bpf: return EOPNOTSUPP when map lookup isn't supported
  tools/bpf: bpftool, split the function do_dump()
  tools/bpf: bpftool, print strerror when map lookup error occurs
  selftests/bpf: test_verifier, change names of fixup maps
  selftests/bpf: test_verifier, check bpf_map_lookup_elem access in bpf prog

Wenwen Wang (1):
  bpf: btf: Fix a missing check bug

Yonghong Song (1):
  tools/bpf: use proper type and uapi perf_event.h header for libbpf

 MAINTAINERS  |   10 +
 include/linux/bpf.h  |   33 +-
 include/linux/bpf_types.h|2 +-
 include/linux/filter.h   |   21 -
 include/linux/skmsg.h|  410 
 include/net/addrconf.h   |5 +
 include/net/sock.h   |4 -
 include/net/tcp.h|   28 +-
 include/net/tls.h|   24 +-
 kernel/bpf/Makefile

[PATCH net 3/3] nfp: flower: use offsets provided by pedit instead of index for ipv6

2018-10-15 Thread Jakub Kicinski

From: Pieter Jansen van Vuuren 

Previously when populating the set ipv6 address action, we incorrectly
made use of pedit's key index to determine which 32bit word should be
set. We now calculate which word has been selected based on the offset
provided by the pedit action.

Fixes: 354b82bb320e ("nfp: add set ipv6 source and destination address")
Signed-off-by: Pieter Jansen van Vuuren 
Reviewed-by: Jakub Kicinski 
---
 .../ethernet/netronome/nfp/flower/action.c| 26 +++
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/action.c 
b/drivers/net/ethernet/netronome/nfp/flower/action.c
index c39d7fdf73e6..7a1e9cd9cc62 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/action.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/action.c
@@ -450,12 +450,12 @@ nfp_fl_set_ip4(const struct tc_action *action, int idx, 
u32 off,
 }
 
 static void
-nfp_fl_set_ip6_helper(int opcode_tag, int idx, __be32 exact, __be32 mask,
+nfp_fl_set_ip6_helper(int opcode_tag, u8 word, __be32 exact, __be32 mask,
  struct nfp_fl_set_ipv6_addr *ip6)
 {
-   ip6->ipv6[idx % 4].mask |= mask;
-   ip6->ipv6[idx % 4].exact &= ~mask;
-   ip6->ipv6[idx % 4].exact |= exact & mask;
+   ip6->ipv6[word].mask |= mask;
+   ip6->ipv6[word].exact &= ~mask;
+   ip6->ipv6[word].exact |= exact & mask;
 
ip6->reserved = cpu_to_be16(0);
ip6->head.jump_id = opcode_tag;
@@ -468,6 +468,7 @@ nfp_fl_set_ip6(const struct tc_action *action, int idx, u32 
off,
   struct nfp_fl_set_ipv6_addr *ip_src)
 {
__be32 exact, mask;
+   u8 word;
 
/* We are expecting tcf_pedit to return a big endian value */
mask = (__force __be32)~tcf_pedit_mask(action, idx);
@@ -476,17 +477,20 @@ nfp_fl_set_ip6(const struct tc_action *action, int idx, 
u32 off,
if (exact & ~mask)
return -EOPNOTSUPP;
 
-   if (off < offsetof(struct ipv6hdr, saddr))
+   if (off < offsetof(struct ipv6hdr, saddr)) {
return -EOPNOTSUPP;
-   else if (off < offsetof(struct ipv6hdr, daddr))
-   nfp_fl_set_ip6_helper(NFP_FL_ACTION_OPCODE_SET_IPV6_SRC, idx,
+   } else if (off < offsetof(struct ipv6hdr, daddr)) {
+   word = (off - offsetof(struct ipv6hdr, saddr)) / sizeof(exact);
+   nfp_fl_set_ip6_helper(NFP_FL_ACTION_OPCODE_SET_IPV6_SRC, word,
  exact, mask, ip_src);
-   else if (off < offsetof(struct ipv6hdr, daddr) +
-  sizeof(struct in6_addr))
-   nfp_fl_set_ip6_helper(NFP_FL_ACTION_OPCODE_SET_IPV6_DST, idx,
+   } else if (off < offsetof(struct ipv6hdr, daddr) +
+  sizeof(struct in6_addr)) {
+   word = (off - offsetof(struct ipv6hdr, daddr)) / sizeof(exact);
+   nfp_fl_set_ip6_helper(NFP_FL_ACTION_OPCODE_SET_IPV6_DST, word,
  exact, mask, ip_dst);
-   else
+   } else {
return -EOPNOTSUPP;
+   }
 
return 0;
 }
-- 
2.17.1

[PATCH net 0/3] nfp: fix pedit set action offloads

2018-10-15 Thread Jakub Kicinski

Hi,

Pieter says:

This set fixes set actions when using multiple pedit actions with
partial masks and with multiple keys per pedit action. Additionally
it fixes set ipv6 pedit action offloads when using it in combination
with other header keys.

The problem would only trigger if one combines multiple pedit actions
of the same type with partial masks, e.g.:

$ tc filter add dev netdev protocol ip parent : \
flower indev netdev \
ip_proto tcp \
action pedit ex munge \ 
ip src set 11.11.11.11 retain 65535 munge \
ip src set 22.22.22.22 retain 4294901760 pipe \
csum ip and tcp pipe \
mirred egress redirect dev netdev

Pieter Jansen van Vuuren (3):
  nfp: flower: fix pedit set actions for multiple partial masks
  nfp: flower: fix multiple keys per pedit action
  nfp: flower: use offsets provided by pedit instead of index for ipv6

 .../ethernet/netronome/nfp/flower/action.c| 51 ---
 1 file changed, 33 insertions(+), 18 deletions(-)

-- 
2.17.1

[PATCH net 1/3] nfp: flower: fix pedit set actions for multiple partial masks

2018-10-15 Thread Jakub Kicinski

From: Pieter Jansen van Vuuren 

Previously we did not correctly change headers when using multiple
pedit actions with partial masks. We now take this into account and
no longer just commit the last pedit action.

Fixes: c0b1bd9a8b8a ("nfp: add set ipv4 header action flower offload")
Signed-off-by: Pieter Jansen van Vuuren 
Reviewed-by: Jakub Kicinski 
---
 .../net/ethernet/netronome/nfp/flower/action.c| 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/action.c 
b/drivers/net/ethernet/netronome/nfp/flower/action.c
index 46ba0cf257c6..91de7a9b0190 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/action.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/action.c
@@ -429,12 +429,14 @@ nfp_fl_set_ip4(const struct tc_action *action, int idx, 
u32 off,
 
switch (off) {
case offsetof(struct iphdr, daddr):
-   set_ip_addr->ipv4_dst_mask = mask;
-   set_ip_addr->ipv4_dst = exact;
+   set_ip_addr->ipv4_dst_mask |= mask;
+   set_ip_addr->ipv4_dst &= ~mask;
+   set_ip_addr->ipv4_dst |= exact & mask;
break;
case offsetof(struct iphdr, saddr):
-   set_ip_addr->ipv4_src_mask = mask;
-   set_ip_addr->ipv4_src = exact;
+   set_ip_addr->ipv4_src_mask |= mask;
+   set_ip_addr->ipv4_src &= ~mask;
+   set_ip_addr->ipv4_src |= exact & mask;
break;
default:
return -EOPNOTSUPP;
@@ -451,8 +453,9 @@ static void
 nfp_fl_set_ip6_helper(int opcode_tag, int idx, __be32 exact, __be32 mask,
  struct nfp_fl_set_ipv6_addr *ip6)
 {
-   ip6->ipv6[idx % 4].mask = mask;
-   ip6->ipv6[idx % 4].exact = exact;
+   ip6->ipv6[idx % 4].mask |= mask;
+   ip6->ipv6[idx % 4].exact &= ~mask;
+   ip6->ipv6[idx % 4].exact |= exact & mask;
 
ip6->reserved = cpu_to_be16(0);
ip6->head.jump_id = opcode_tag;
-- 
2.17.1

[PATCH net 2/3] nfp: flower: fix multiple keys per pedit action

2018-10-15 Thread Jakub Kicinski

From: Pieter Jansen van Vuuren 

Previously we only allowed a single header key per pedit action to
change the header. This used to result in the last header key in the
pedit action to overwrite previous headers. We now keep track of them
and allow multiple header keys per pedit action.

Fixes: c0b1bd9a8b8a ("nfp: add set ipv4 header action flower offload")
Fixes: 354b82bb320e ("nfp: add set ipv6 source and destination address")
Fixes: f8b7b0a6b113 ("nfp: add set tcp and udp header action flower offload")
Signed-off-by: Pieter Jansen van Vuuren 
Reviewed-by: Jakub Kicinski 
---
 .../net/ethernet/netronome/nfp/flower/action.c   | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/action.c 
b/drivers/net/ethernet/netronome/nfp/flower/action.c
index 91de7a9b0190..c39d7fdf73e6 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/action.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/action.c
@@ -544,7 +544,7 @@ nfp_fl_pedit(const struct tc_action *action, struct 
tc_cls_flower_offload *flow,
struct nfp_fl_set_eth set_eth;
enum pedit_header_type htype;
int idx, nkeys, err;
-   size_t act_size;
+   size_t act_size = 0;
u32 offset, cmd;
u8 ip_proto = 0;
 
@@ -602,7 +602,9 @@ nfp_fl_pedit(const struct tc_action *action, struct 
tc_cls_flower_offload *flow,
act_size = sizeof(set_eth);
memcpy(nfp_action, _eth, act_size);
*a_len += act_size;
-   } else if (set_ip_addr.head.len_lw) {
+   }
+   if (set_ip_addr.head.len_lw) {
+   nfp_action += act_size;
act_size = sizeof(set_ip_addr);
memcpy(nfp_action, _ip_addr, act_size);
*a_len += act_size;
@@ -610,10 +612,12 @@ nfp_fl_pedit(const struct tc_action *action, struct 
tc_cls_flower_offload *flow,
/* Hardware will automatically fix IPv4 and TCP/UDP checksum. */
*csum_updated |= TCA_CSUM_UPDATE_FLAG_IPV4HDR |
nfp_fl_csum_l4_to_flag(ip_proto);
-   } else if (set_ip6_dst.head.len_lw && set_ip6_src.head.len_lw) {
+   }
+   if (set_ip6_dst.head.len_lw && set_ip6_src.head.len_lw) {
/* TC compiles set src and dst IPv6 address as a single action,
 * the hardware requires this to be 2 separate actions.
 */
+   nfp_action += act_size;
act_size = sizeof(set_ip6_src);
memcpy(nfp_action, _ip6_src, act_size);
*a_len += act_size;
@@ -626,6 +630,7 @@ nfp_fl_pedit(const struct tc_action *action, struct 
tc_cls_flower_offload *flow,
/* Hardware will automatically fix TCP/UDP checksum. */
*csum_updated |= nfp_fl_csum_l4_to_flag(ip_proto);
} else if (set_ip6_dst.head.len_lw) {
+   nfp_action += act_size;
act_size = sizeof(set_ip6_dst);
memcpy(nfp_action, _ip6_dst, act_size);
*a_len += act_size;
@@ -633,13 +638,16 @@ nfp_fl_pedit(const struct tc_action *action, struct 
tc_cls_flower_offload *flow,
/* Hardware will automatically fix TCP/UDP checksum. */
*csum_updated |= nfp_fl_csum_l4_to_flag(ip_proto);
} else if (set_ip6_src.head.len_lw) {
+   nfp_action += act_size;
act_size = sizeof(set_ip6_src);
memcpy(nfp_action, _ip6_src, act_size);
*a_len += act_size;
 
/* Hardware will automatically fix TCP/UDP checksum. */
*csum_updated |= nfp_fl_csum_l4_to_flag(ip_proto);
-   } else if (set_tport.head.len_lw) {
+   }
+   if (set_tport.head.len_lw) {
+   nfp_action += act_size;
act_size = sizeof(set_tport);
memcpy(nfp_action, _tport, act_size);
*a_len += act_size;
-- 
2.17.1

Re: [PATCH bpf-next v2] tools: bpftool: add map create command

2018-10-15 Thread Alexei Starovoitov

On Mon, Oct 15, 2018 at 04:30:36PM -0700, Jakub Kicinski wrote:
> Add a way of creating maps from user space.  The command takes
> as parameters most of the attributes of the map creation system
> call command.  After map is created its pinned to bpffs.  This makes
> it possible to easily and dynamically (without rebuilding programs)
> test various corner cases related to map creation.
> 
> Map type names are taken from bpftool's array used for printing.
> In general these days we try to make use of libbpf type names, but
> there are no map type names in libbpf as of today.
> 
> As with most features I add the motivation is testing (offloads) :)
> 
> Signed-off-by: Jakub Kicinski 
> Reviewed-by: Quentin Monnet 

Applied, Thanks

Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-10-15 Thread Song Liu

On Fri, Sep 21, 2018 at 3:15 PM Alexei Starovoitov
 wrote:
>
> On Fri, Sep 21, 2018 at 09:25:00AM -0300, Arnaldo Carvalho de Melo wrote:
> >
> > > I have considered adding MUNMAP to match existing MMAP, but went
> > > without it because I didn't want to introduce new bit in perf_event_attr
> > > and emit these new events in a misbalanced conditional way for prog 
> > > load/unload.
> > > Like old perf is asking kernel for mmap events via mmap bit, so prog load 
> > > events
> >
> > By prog load events you mean that old perf, having perf_event_attr.mmap = 1 
> > ||
> > perf_event_attr.mmap2 = 1 will cause the new kernel to emit
> > PERF_RECORD_MMAP records for the range of addresses that a BPF program
> > is being loaded on, right?
>
> right. it would be weird when prog load events are there, but not unload.
>
> > > will be in perf.data, but old perf report won't recognize them anyway.
> >
> > Why not? It should lookup the symbol and find it in the rb_tree of maps,
> > with a DSO name equal to what was in the PERF_RECORD_MMAP emitted by the
> > BPF core, no? It'll be an unresolved symbol, but a resolved map.
> >
> > > Whereas new perf would certainly want to catch bpf events and will set
> > > both mmap and mumap bits.
> >
> > new perf with your code will find a symbol, not a map, because your code
> > catches a special case PERF_RECORD_MMAP and instead of creating a
> > 'struct map' will create a 'struct symbol' and insert it in the kallsyms
> > 'struct map', right?
>
> right.
> bpf progs are more similar to kernel functions than to modules.
> For modules it makes sense to create a new map and insert symbols into it.
> For bpf JITed images there is no DSO to parse.
> Single bpf elf file may contain multiple bpf progs and each prog may contain
> multiple bpf functions. They will be loaded at different time and
> will have different life time.
>
> > In theory the old perf should catch the PERF_RECORD_MMAP with a string
> > in the filename part and insert a new map into the kernel mmap rb_tree,
> > and then samples would be resolved to this map, but since there is no
> > backing DSO with a symtab, it would stop at that, just stating that the
> > map is called NAME-OF-BPF-PROGRAM. This is all from memory, possibly
> > there is something in there that makes it ignore this PERF_RECORD_MMAP
> > emitted by the BPF kernel code when loading a new program.
>
> In /proc/kcore there is already a section for module range.
> Hence when perf processes bpf load/unload events the map is already created.
> Therefore the patch 3 only searches for it and inserts new symbol into it.
>
> In that sense the reuse of RECORD_MMAP event for bpf progs is indeed
> not exactly clean, since no new map is created.
> It's probably better to introduce PERF_RECORD_[INSERT|ERASE]_KSYM events ?
>
> Such event potentially can be used for offline ksym resolution.
> perf could process /proc/kallsyms during perf record and emit all of them
> as synthetic PERF_RECORD_INSERT_KSYM into perf.data, so perf report can run
> on a different server and still find the right symbols.
>
> I guess, we can do bpf specific events too and keep RECORD_MMAP as-is.
> How about single PERF_RECORD_BPF event with internal flag for load/unload ?
>
> > Right, that is another unfortunate state of affairs, kernel module
> > load/unload should already be supported, reported by the kernel via a
> > proper PERF_RECORD_MODULE_LOAD/UNLOAD
>
> I agree with Peter here. It would nice, but low priority.
> modules are mostly static. Loaded once and stay there.
>
> > There is another longstanding TODO list entry: PERF_RECORD_MMAP records
> > should include a build-id, to avoid either userspace getting confused
> > when there is an update of some mmap DSO, for long running sessions, for
> > instance, or to have to scan the just recorded perf.data file for DSOs
> > with samples to then read it from the file system (more races).
> >
> > Have you ever considered having a build-id for bpf objects that could be
> > used here?
>
> build-id concept is not applicable to bpf.
> bpf elf files on the disc don't have good correlation with what is
> running in the kernel. bpf bytestream is converted and optimized
> by the verifier. Then JITed.
> So debug info left in .o file and original bpf bytestream in .o are
> mostly useless.
> For bpf programs we have 'program tag'. It is computed over original
> bpf bytestream, so both kernel and user space can compute it.
> In libbcc we use /var/tmp/bcc/bpf_prog_TAG/ directory to store original
> source code of the program, so users looking at kernel stack traces
> with bpf_prog_TAG can find the source.
> It's similar to build-id, but not going to help perf to annotate
> actual x86 instructions inside JITed image and show src code.
> Since JIT runs in the kernel this problem cannot be solved by user space only.
> It's a difficult problem and we have a plan to tackle that,
> but it's step 2. A bunch of infra is needed on bpf side to
> preserve the

[PATCH bpf-next v2] tools: bpftool: add map create command

2018-10-15 Thread Jakub Kicinski

Add a way of creating maps from user space.  The command takes
as parameters most of the attributes of the map creation system
call command.  After map is created its pinned to bpffs.  This makes
it possible to easily and dynamically (without rebuilding programs)
test various corner cases related to map creation.

Map type names are taken from bpftool's array used for printing.
In general these days we try to make use of libbpf type names, but
there are no map type names in libbpf as of today.

As with most features I add the motivation is testing (offloads) :)

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 .../bpf/bpftool/Documentation/bpftool-map.rst |  15 ++-
 tools/bpf/bpftool/Documentation/bpftool.rst   |   4 +-
 tools/bpf/bpftool/bash-completion/bpftool |  38 +-
 tools/bpf/bpftool/common.c|  21 
 tools/bpf/bpftool/main.h  |   1 +
 tools/bpf/bpftool/map.c   | 110 +-
 6 files changed, 183 insertions(+), 6 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-map.rst 
b/tools/bpf/bpftool/Documentation/bpftool-map.rst
index a6258bc8ec4f..3497f2d80328 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-map.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-map.rst
@@ -15,13 +15,15 @@ SYNOPSIS
*OPTIONS* := { { **-j** | **--json** } [{ **-p** | **--pretty** }] | { 
**-f** | **--bpffs** } }
 
*COMMANDS* :=
-   { **show** | **list** | **dump** | **update** | **lookup** | 
**getnext** | **delete**
-   | **pin** | **help** }
+   { **show** | **list** | **create** | **dump** | **update** | **lookup** 
| **getnext**
+   | **delete** | **pin** | **help** }
 
 MAP COMMANDS
 =
 
 |  **bpftool** **map { show | list }**   [*MAP*]
+|  **bpftool** **map create** *FILE* **type** *TYPE* **key** 
*KEY_SIZE* **value** *VALUE_SIZE* \
+|  **entries** *MAX_ENTRIES* **name** *NAME* [**flags** *FLAGS*] 
[**dev** *NAME*]
 |  **bpftool** **map dump**   *MAP*
 |  **bpftool** **map update** *MAP*  **key** *DATA*   **value** 
*VALUE* [*UPDATE_FLAGS*]
 |  **bpftool** **map lookup** *MAP*  **key** *DATA*
@@ -36,6 +38,11 @@ MAP COMMANDS
 |  *PROG* := { **id** *PROG_ID* | **pinned** *FILE* | **tag** *PROG_TAG* }
 |  *VALUE* := { *DATA* | *MAP* | *PROG* }
 |  *UPDATE_FLAGS* := { **any** | **exist** | **noexist** }
+|  *TYPE* := { **hash** | **array** | **prog_array** | 
**perf_event_array** | **percpu_hash**
+|  | **percpu_array** | **stack_trace** | **cgroup_array** | 
**lru_hash**
+|  | **lru_percpu_hash** | **lpm_trie** | **array_of_maps** | 
**hash_of_maps**
+|  | **devmap** | **sockmap** | **cpumap** | **xskmap** | 
**sockhash**
+|  | **cgroup_storage** | **reuseport_sockarray** | 
**percpu_cgroup_storage** }
 
 DESCRIPTION
 ===
@@ -47,6 +54,10 @@ DESCRIPTION
  Output will start with map ID followed by map type and
  zero or more named attributes (depending on kernel version).
 
+   **bpftool map create** *FILE* **type** *TYPE* **key** *KEY_SIZE* 
**value** *VALUE_SIZE*  **entries** *MAX_ENTRIES* **name** *NAME* [**flags** 
*FLAGS*] [**dev** *NAME*]
+ Create a new map with given parameters and pin it to *bpffs*
+ as *FILE*.
+
**bpftool map dump***MAP*
  Dump all entries in a given *MAP*.
 
diff --git a/tools/bpf/bpftool/Documentation/bpftool.rst 
b/tools/bpf/bpftool/Documentation/bpftool.rst
index 65488317fefa..04cd4f92ab89 100644
--- a/tools/bpf/bpftool/Documentation/bpftool.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool.rst
@@ -22,8 +22,8 @@ SYNOPSIS
| { **-j** | **--json** } [{ **-p** | **--pretty** }] }
 
*MAP-COMMANDS* :=
-   { **show** | **list** | **dump** | **update** | **lookup** | 
**getnext** | **delete**
-   | **pin** | **event_pipe** | **help** }
+   { **show** | **list** | **create** | **dump** | **update** | **lookup** 
| **getnext**
+   | **delete** | **pin** | **event_pipe** | **help** }
 
*PROG-COMMANDS* := { **show** | **list** | **dump jited** | **dump 
xlated** | **pin**
| **load** | **attach** | **detach** | **help** }
diff --git a/tools/bpf/bpftool/bash-completion/bpftool 
b/tools/bpf/bpftool/bash-completion/bpftool
index ac85207cba8d..c56545e87b0d 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -387,6 +387,42 @@ _bpftool()
 ;;
 esac
 ;;
+create)
+case $prev in
+$command)
+_filedir
+return 0
+;;
+type)
+COMPREPLY=( $( compgen -W 'hash array prog_array \
+

Re: [bpf-next PATCH v3 0/2] bpftool support for sockmap use cases

2018-10-15 Thread Alexei Starovoitov

On Mon, Oct 15, 2018 at 11:19:44AM -0700, John Fastabend wrote:
> The first patch adds support for attaching programs to maps. This is
> needed to support sock{map|hash} use from bpftool. Currently, I carry
> around custom code to do this so doing it using standard bpftool will
> be great.
> 
> The second patch adds a compat mode to ignore non-zero entries in
> the map def. This allows using bpftool with maps that have a extra
> fields that the user knows can be ignored. This is needed to work
> correctly with maps being loaded by other tools or directly via
> syscalls.
> 
> v3: add bash completion and doc updates for --mapcompat

Applied, Thanks

Re: [PATCH bpf-next 05/13] bpf: get better bpf_prog ksyms based on btf func type_id

2018-10-15 Thread Martin Lau

On Fri, Oct 12, 2018 at 11:54:42AM -0700, Yonghong Song wrote:
> This patch added interface to load a program with the following
> additional information:
>. prog_btf_fd
>. func_info and func_info_len
> where func_info will provides function range and type_id
> corresponding to each function.
> 
> If verifier agrees with function range provided by the user,
> the bpf_prog ksym for each function will use the func name
> provided in the type_id, which is supposed to provide better
> encoding as it is not limited by 16 bytes program name
> limitation and this is better for bpf program which contains
> multiple subprograms.
> 
> The bpf_prog_info interface is also extended to
> return btf_id and jited_func_types, so user spaces can
> print out the function prototype for each jited function.
Some nits.

> 
> Signed-off-by: Yonghong Song 
> ---
>  include/linux/bpf.h  |  2 +
>  include/linux/bpf_verifier.h |  1 +
>  include/linux/btf.h  |  2 +
>  include/uapi/linux/bpf.h | 11 +
>  kernel/bpf/btf.c | 16 +++
>  kernel/bpf/core.c|  9 
>  kernel/bpf/syscall.c | 86 +++-
>  kernel/bpf/verifier.c| 50 +
>  8 files changed, 176 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 9b558713447f..e9c63ffa01af 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -308,6 +308,8 @@ struct bpf_prog_aux {
>   void *security;
>  #endif
>   struct bpf_prog_offload *offload;
> + struct btf *btf;
> + u32 type_id; /* type id for this prog/func */
>   union {
>   struct work_struct work;
>   struct rcu_head rcu;
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index 9e8056ec20fa..e84782ec50ac 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -201,6 +201,7 @@ static inline bool bpf_verifier_log_needed(const struct 
> bpf_verifier_log *log)
>  struct bpf_subprog_info {
>   u32 start; /* insn idx of function entry point */
>   u16 stack_depth; /* max. stack depth used by this function */
> + u32 type_id; /* btf type_id for this subprog */
>  };
>  
>  /* single container for all structs
> diff --git a/include/linux/btf.h b/include/linux/btf.h
> index e076c4697049..90e91b52aa90 100644
> --- a/include/linux/btf.h
> +++ b/include/linux/btf.h
> @@ -46,5 +46,7 @@ void btf_type_seq_show(const struct btf *btf, u32 type_id, 
> void *obj,
>  struct seq_file *m);
>  int btf_get_fd_by_id(u32 id);
>  u32 btf_id(const struct btf *btf);
> +bool is_btf_func_type(const struct btf *btf, u32 type_id);
> +const char *btf_get_name_by_id(const struct btf *btf, u32 type_id);
>  
>  #endif
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index f9187b41dff6..7ebbf4f06a65 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -332,6 +332,9 @@ union bpf_attr {
>* (context accesses, allowed helpers, etc).
>*/
>   __u32   expected_attach_type;
> + __u32   prog_btf_fd;/* fd pointing to BTF type data 
> */
> + __u32   func_info_len;  /* func_info length */
> + __aligned_u64   func_info;  /* func type info */
>   };
>  
>   struct { /* anonymous struct used by BPF_OBJ_* commands */
> @@ -2585,6 +2588,9 @@ struct bpf_prog_info {
>   __u32 nr_jited_func_lens;
>   __aligned_u64 jited_ksyms;
>   __aligned_u64 jited_func_lens;
> + __u32 btf_id;
> + __u32 nr_jited_func_types;
> + __aligned_u64 jited_func_types;
>  } __attribute__((aligned(8)));
>  
>  struct bpf_map_info {
> @@ -2896,4 +2902,9 @@ struct bpf_flow_keys {
>   };
>  };
>  
> +struct bpf_func_info {
> + __u32   insn_offset;
> + __u32   type_id;
> +};
> +
>  #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index 794a185f11bf..85b8eeccddbd 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -486,6 +486,15 @@ static const struct btf_type *btf_type_by_id(const 
> struct btf *btf, u32 type_id)
>   return btf->types[type_id];
>  }
>  
> +bool is_btf_func_type(const struct btf *btf, u32 type_id)
> +{
> + const struct btf_type *type = btf_type_by_id(btf, type_id);
> +
> + if (!type || BTF_INFO_KIND(type->info) != BTF_KIND_FUNC)
> + return false;
> + return true;
> +}
Can btf_type_is_func() (from patch 2) be reused?
The btf_type_by_id() can be done by the caller.
I don't think it worths to add a similar helper
for just one user for now.

The !type check can be added to btf_type_is_func() if
it is needed.

> +
>  /*
>   * Regular int is not a bit field and it must be either
>   * u8/u16/u32/u64.
> @@ -2579,3 +2588,10 @@ u32 btf_id(const struct btf *btf)
>  {
>   return btf->id;
>  }
> +
> +const char

Re: [PATCH bpf-next 0/2] IPv6 sk-lookup fixes

2018-10-15 Thread Alexei Starovoitov

On Mon, Oct 15, 2018 at 10:27:44AM -0700, Joe Stringer wrote:
> This series includes a couple of fixups for the IPv6 socket lookup
> helper, to make the API more consistent (always supply all arguments in
> network byte-order) and to allow its use when IPv6 is compiled as a
> module.

Applied, Thanks

Re: [PATCH bpf-next] tools: bpftool: add map create command

2018-10-15 Thread Jakub Kicinski

On Mon, 15 Oct 2018 12:58:07 -0700, Alexei Starovoitov wrote:
> > > > fprintf(stderr,
> > > > "Usage: %s %s { show | list }   [MAP]\n"
> > > > +   "   %s %s create FILE type TYPE key KEY_SIZE 
> > > > value VALUE_SIZE \\\n"
> > > > +   "  entries MAX_ENTRIES 
> > > > [name NAME] [flags FLAGS] \\\n"
> > > > +   "  [dev NAME]\n"
> > > 
> > > I suspect as soon as bpftool has an ability to create standalone maps
> > > some folks will start relying on such interface.  
> > 
> > That'd be cool, do you see any real life use cases where its useful
> > outside of corner case testing?  
> 
> In our XDP use case we have an odd protocol for different apps to share
> common prog_array that is pinned in bpffs.
> If cmdline creation of it via bpftool was available that would have been
> an option to consider. Not saying that it would have been a better option.
> Just another option.

I see, I didn't think of prog arrays.

> > > Therefore I'd like to request to make 'name' argument to be mandatory.  
> > 
> > Will do in v2!  
> 
> thx!
>  
> > > I think in the future we will require BTF to be mandatory too.
> > > We need to move towards more transparent and debuggable infra.
> > > Do you think requiring json description of key/value would be managable 
> > > to implement?
> > > Then bpftool could convert it to BTF and the map full be fully defined.
> > > I certainly understand that bpf prog can disregard the key/value layout 
> > > today,
> > > but we will make verifier to enforce that in the future too.  
> > 
> > I was hoping that we can leave BTF support as a future extension, and
> > then once we have the option for the verifier to enforce BTF (a sysctl?)
> > the bpftool map create without a BTF will get rejected as one would
> > expect.
> 
> right. something like sysctl in the future.
> 
> > IOW it's fine not to make BTF required at bpftool level and
> > leave it to system configuration.
> > 
> > I'd love to implement the BTF support right away, but I'm not sure I
> > can afford that right now time-wise.  The whole map create command is
> > pretty trivial, but for BTF we don't even have a way of dumping it
> > AFAICT.  We can pretty print values, but what is the format in which to
> > express the BTF itself?  We could do JSON, do we use an external
> > library?  Should we have a separate BTF command for that?  
> 
> I prefer standard C type description for both input and output :)
> Anyway that wasn't a request for you to do it now. More of the feature
> request for somebody to put on todo list :)

Oh, okay :)  

I will wait for John's patches to get merged and post v2, otherwise
we'd conflict on the man page.

Re: [PATCH bpf-next 02/13] bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO

2018-10-15 Thread Daniel Borkmann

On 10/12/2018 08:54 PM, Yonghong Song wrote:
[...]
> +static bool btf_name_valid_identifier(const struct btf *btf, u32 offset)
> +{
> + /* offset must be valid */
> + const char *src = >strings[offset];
> +
> + if (!isalpha(*src) && *src != '_')
> + return false;
> +
> + src++;
> + while (*src) {
> + if (!isalnum(*src) && *src != '_')
> + return false;
> + src++;
> + }
> +
> + return true;
> +}

Should there be an upper name length limit like KSYM_NAME_LEN? (Is it implied
by the kvmalloc() limit?)

>  static const char *btf_name_by_offset(const struct btf *btf, u32 offset)
>  {
>   if (!offset)
> @@ -747,7 +782,9 @@ static bool env_type_is_resolve_sink(const struct 
> btf_verifier_env *env,
>   /* int, enum or void is a sink */
>   return !btf_type_needs_resolve(next_type);
>   case RESOLVE_PTR:
> - /* int, enum, void, struct or array is a sink for ptr */
> + /* int, enum, void, struct, array or func_ptoto is a sink
> +  * for ptr
> +  */
>   return !btf_type_is_modifier(next_type) &&
>   !btf_type_is_ptr(next_type);
>   case RESOLVE_STRUCT_OR_ARRAY:

Re: Fw: [Bug 201423] New: eth0: hw csum failure

2018-10-15 Thread Fabio Rossi




On 15 October 2018 17:41:47 CEST, Eric Dumazet  wrote:
>On Mon, Oct 15, 2018 at 8:15 AM Stephen Hemminger
> wrote:
>>
>>
>>
>> Begin forwarded message:
>>
>> Date: Sun, 14 Oct 2018 10:42:48 +
>> From: bugzilla-dae...@bugzilla.kernel.org
>> To: step...@networkplumber.org
>> Subject: [Bug 201423] New: eth0: hw csum failure
>>
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=201423
>>
>> Bug ID: 201423
>>Summary: eth0: hw csum failure
>>Product: Networking
>>Version: 2.5
>> Kernel Version: 4.19.0-rc7
>>   Hardware: Intel
>> OS: Linux
>>   Tree: Mainline
>> Status: NEW
>>   Severity: normal
>>   Priority: P1
>>  Component: Other
>>   Assignee: step...@networkplumber.org
>>   Reporter: ross...@inwind.it
>> Regression: No
>>
>> I have a P6T DELUXE V2 motherboard and using the sky2 driver for the
>ethernet
>> ports. I get the following error message:
>>
>> [  433.727397] eth0: hw csum failure
>> [  433.727406] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.19.0-rc7
>#19
>> [  433.727406] Hardware name: System manufacturer System Product
>Name/P6T
>> DELUXE V2, BIOS 120212/22/2010
>> [  433.727407] Call Trace:
>> [  433.727409]  
>> [  433.727415]  dump_stack+0x46/0x5b
>> [  433.727419]  __skb_checksum_complete+0xb0/0xc0
>> [  433.727423]  tcp_v4_rcv+0x528/0xb60
>> [  433.727426]  ? ipt_do_table+0x2d0/0x400
>> [  433.727429]  ip_local_deliver_finish+0x5a/0x110
>> [  433.727430]  ip_local_deliver+0xe1/0xf0
>> [  433.727431]  ? ip_sublist_rcv_finish+0x60/0x60
>> [  433.727432]  ip_rcv+0xca/0xe0
>> [  433.727434]  ? ip_rcv_finish_core.isra.0+0x300/0x300
>> [  433.727436]  __netif_receive_skb_one_core+0x4b/0x70
>> [  433.727438]  netif_receive_skb_internal+0x4e/0x130
>> [  433.727439]  napi_gro_receive+0x6a/0x80
>> [  433.727442]  sky2_poll+0x707/0xd20
>> [  433.727446]  ? rcu_check_callbacks+0x1b4/0x900
>> [  433.727447]  net_rx_action+0x237/0x380
>> [  433.727449]  __do_softirq+0xdc/0x1e0
>> [  433.727452]  irq_exit+0xa9/0xb0
>> [  433.727453]  do_IRQ+0x45/0xc0
>> [  433.727455]  common_interrupt+0xf/0xf
>> [  433.727456]  
>> [  433.727459] RIP: 0010:cpuidle_enter_state+0x124/0x200
>> [  433.727461] Code: 53 60 89 c3 e8 dd 90 ad ff 65 8b 3d 96 58 a7 7e
>e8 d1 8f
>> ad ff 31 ff 49 89 c4 e8 27 99 ad ff fb 48 ba cf f7 53 e3 a5 9b c4 20
><4c> 89 e1
>> 4c 29 e9 48 89 c8 48 c1 f9 3f 48 f7 ea b8 ff ff ff 7f 48
>> [  433.727462] RSP: :c90a3e98 EFLAGS: 0282 ORIG_RAX:
>> ffde
>> [  433.727463] RAX: 880237b1f280 RBX: 0004 RCX:
>> 001f
>> [  433.727464] RDX: 20c49ba5e353f7cf RSI: 2fe419c1 RDI:
>> 
>> [  433.727465] RBP: 880237b263a0 R08: 0714 R09:
>> 00650512105d
>> [  433.727465] R10:  R11: 0342 R12:
>> 0064fc2a8b1c
>> [  433.727466] R13: 0064fc25b35f R14: 0004 R15:
>> 8204af20
>> [  433.727468]  ? cpuidle_enter_state+0x119/0x200
>> [  433.727471]  do_idle+0x1bf/0x200
>> [  433.727473]  cpu_startup_entry+0x6a/0x70
>> [  433.727475]  start_secondary+0x17f/0x1c0
>> [  433.727476]  secondary_startup_64+0xa4/0xb0
>> [  441.662954] eth0: hw csum failure
>> [  441.662959] CPU: 4 PID: 4347 Comm: radeon_cs:0 Not tainted
>4.19.0-rc7 #19
>> [  441.662960] Hardware name: System manufacturer System Product
>Name/P6T
>> DELUXE V2, BIOS 120212/22/2010
>> [  441.662960] Call Trace:
>> [  441.662963]  
>> [  441.662968]  dump_stack+0x46/0x5b
>> [  441.662972]  __skb_checksum_complete+0xb0/0xc0
>> [  441.662975]  tcp_v4_rcv+0x528/0xb60
>> [  441.662979]  ? ipt_do_table+0x2d0/0x400
>> [  441.662981]  ip_local_deliver_finish+0x5a/0x110
>> [  441.662983]  ip_local_deliver+0xe1/0xf0
>> [  441.662985]  ? ip_sublist_rcv_finish+0x60/0x60
>> [  441.662986]  ip_rcv+0xca/0xe0
>> [  441.662988]  ? ip_rcv_finish_core.isra.0+0x300/0x300
>> [  441.662990]  __netif_receive_skb_one_core+0x4b/0x70
>> [  441.662993]  netif_receive_skb_internal+0x4e/0x130
>> [  441.662994]  napi_gro_receive+0x6a/0x80
>> [  441.662998]  sky2_poll+0x707/0xd20
>> [  441.663000]  net_rx_action+0x237/0x380
>> [  441.663002]  __do_softirq+0xdc/0x1e0
>> [  441.663005]  irq_exit+0xa9/0xb0
>> [  441.663007]  do_IRQ+0x45/0xc0
>> [  441.663009]  common_interrupt+0xf/0xf
>> [  441.663010]  
>> [  441.663012] RIP: 0010:merge+0x22/0xb0
>> [  441.663014] Code: c3 31 c0 c3 90 90 90 90 41 56 41 55 41 54 55 48
>89 d5 53
>> 48 89 cb 48 83 ec 18 65 48 8b 04 25 28 00 00 00 48 89 44 24 10 31 c0
><48> 85 c9
>> 74 70 48 85 d2 74 6b 49 89 fd 49 89 f6 49 89 e4 eb 14 48
>> [  441.663015] RSP: 0018:c990b988 EFLAGS: 0246 ORIG_RAX:
>> ffde
>> [  441.663017] RAX:  RBX: 88021ab2d408 RCX:
>> 88021ab2d408
>> [  441.663018] RDX: 88021ab2d388 RSI: a021c440 RDI:
>> 
>> [  441.663019] RBP: 88021ab2d388 R08: 5ecf

Re: [PATCH bpf-next 02/13] bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO

2018-10-15 Thread Daniel Borkmann

On 10/12/2018 08:54 PM, Yonghong Song wrote:
> This patch adds BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
> support to the type section. BTF_KIND_FUNC_PROTO is used
> to specify the type of a function pointer. With this,
> BTF has a complete set of C types (except float).
> 
> BTF_KIND_FUNC is used to specify the signature of a
> defined subprogram. BTF_KIND_FUNC_PROTO can be referenced
> by another type, e.g., a pointer type, and BTF_KIND_FUNC
> type cannot be referenced by another type.
> 
> For both BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO types,
> the func return type is in t->type (where t is a
> "struct btf_type" object). The func args are an array of
> u32s immediately following object "t".
> 
> As a concrete example, for the C program below,
>   $ cat test.c
>   int foo(int (*bar)(int)) { return bar(5); }
> with latest llvm trunk built with Debug mode, we have
>   $ clang -target bpf -g -O2 -mllvm -debug-only=btf -c test.c
>   Type Table:
>   [1] FUNC name_off=1 info=0x0c01 size/type=2
>   param_type=3
>   [2] INT name_off=11 info=0x0100 size/type=4
>   desc=0x0120
>   [3] PTR name_off=0 info=0x0200 size/type=4
>   [4] FUNC_PROTO name_off=0 info=0x0d01 size/type=2
>   param_type=2
> 
>   String Table:
>   0 :
>   1 : foo
>   5 : .text
>   11 : int
>   15 : test.c
>   22 : int foo(int (*bar)(int)) { return bar(5); }
> 
>   FuncInfo Table:
>   sec_name_off=5
>   insn_offset= type_id=1
> 
>   ...
> 
> (Eventually we shall have bpftool to dump btf information
>  like the above.)
> 
> Function "foo" has a FUNC type (type_id = 1).
> The parameter of "foo" has type_id 3 which is PTR->FUNC_PROTO,
> where FUNC_PROTO refers to function pointer "bar".

Should also "bar" be part of the string table (at least at some point in 
future)?
Iow, if verifier hints to an issue in the program when it would for example walk
pointers and rewrite ctx access, then it could dump the var name along with it.
It might be useful as well in combination with 22 from str table, when 
annotating
the source. We might need support for variadic functions, though. How is LLVM
handling the latter with the recent BTF support?

> In FuncInfo Table, for section .text, the function,
> with to-be-determined offset (marked as ),
> has type_id=1 which refers to a FUNC type.
> This way, the function signature is
> available to both kernel and user space.
> Here, the insn offset is not available during the dump time
> as relocation is resolved pretty late in the compilation process.
> 
> Signed-off-by: Martin KaFai Lau 
> Signed-off-by: Yonghong Song

Re: [PATCH net] sctp: use the pmtu from the icmp packet to update transport pathmtu

2018-10-15 Thread Marcelo Ricardo Leitner

On Mon, Oct 15, 2018 at 07:58:29PM +0800, Xin Long wrote:
> Other than asoc pmtu sync from all transports, sctp_assoc_sync_pmtu
> is also processing transport pmtu_pending by icmp packets. But it's
> meaningless to use sctp_dst_mtu(t->dst) as new pmtu for a transport.
> 
> The right pmtu value should come from the icmp packet, and it would
> be saved into transport->mtu_info in this patch and used later when
> the pmtu sync happens in sctp_sendmsg_to_asoc or sctp_packet_config.
> 
> Besides, without this patch, as pmtu can only be updated correctly
> when receiving a icmp packet and no place is holding sock lock, it
> will take long time if the sock is busy with sending packets.
> 
> Note that it doesn't process transport->mtu_info in .release_cb(),
> as there is no enough information for pmtu update, like for which
> asoc or transport. It is not worth traversing all asocs to check
> pmtu_pending. So unlike tcp, sctp does this in tx path, for which
> mtu_info needs to be atomic_t.
> 
> Signed-off-by: Xin Long 

Acked-by: Marcelo Ricardo Leitner 

> ---
>  include/net/sctp/structs.h | 2 ++
>  net/sctp/associola.c   | 3 ++-
>  net/sctp/input.c   | 1 +
>  net/sctp/output.c  | 6 ++
>  4 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> index 28a7c8e..a11f937 100644
> --- a/include/net/sctp/structs.h
> +++ b/include/net/sctp/structs.h
> @@ -876,6 +876,8 @@ struct sctp_transport {
>   unsigned long sackdelay;
>   __u32 sackfreq;
>  
> + atomic_t mtu_info;
> +
>   /* When was the last time that we heard from this transport? We use
>* this to pick new active and retran paths.
>*/
> diff --git a/net/sctp/associola.c b/net/sctp/associola.c
> index 297d9cf..a827a1f 100644
> --- a/net/sctp/associola.c
> +++ b/net/sctp/associola.c
> @@ -1450,7 +1450,8 @@ void sctp_assoc_sync_pmtu(struct sctp_association *asoc)
>   /* Get the lowest pmtu of all the transports. */
>   list_for_each_entry(t, >peer.transport_addr_list, transports) {
>   if (t->pmtu_pending && t->dst) {
> - sctp_transport_update_pmtu(t, sctp_dst_mtu(t->dst));
> + sctp_transport_update_pmtu(t,
> +atomic_read(>mtu_info));
>   t->pmtu_pending = 0;
>   }
>   if (!pmtu || (t->pathmtu < pmtu))
> diff --git a/net/sctp/input.c b/net/sctp/input.c
> index 9bbc5f9..5c36a99 100644
> --- a/net/sctp/input.c
> +++ b/net/sctp/input.c
> @@ -395,6 +395,7 @@ void sctp_icmp_frag_needed(struct sock *sk, struct 
> sctp_association *asoc,
>   return;
>  
>   if (sock_owned_by_user(sk)) {
> + atomic_set(>mtu_info, pmtu);
>   asoc->pmtu_pending = 1;
>   t->pmtu_pending = 1;
>   return;
> diff --git a/net/sctp/output.c b/net/sctp/output.c
> index 7f849b0..67939ad 100644
> --- a/net/sctp/output.c
> +++ b/net/sctp/output.c
> @@ -120,6 +120,12 @@ void sctp_packet_config(struct sctp_packet *packet, 
> __u32 vtag,
>   sctp_assoc_sync_pmtu(asoc);
>   }
>  
> + if (asoc->pmtu_pending) {
> + if (asoc->param_flags & SPP_PMTUD_ENABLE)
> + sctp_assoc_sync_pmtu(asoc);
> + asoc->pmtu_pending = 0;
> + }
> +
>   /* If there a is a prepend chunk stick it on the list before
>* any other chunks get appended.
>*/
> -- 
> 2.1.0
>

Re: [PATCH iproute2] macsec: fix off-by-one when parsing attributes

2018-10-15 Thread Sabrina Dubroca

2018-10-15, 09:36:58 -0700, Stephen Hemminger wrote:
> On Fri, 12 Oct 2018 17:34:12 +0200
> Sabrina Dubroca  wrote:
> 
> > I seem to have had a massive brainfart with uses of
> > parse_rtattr_nested(). The rtattr* array must have MAX+1 elements, and
> > the call to parse_rtattr_nested must have MAX as its bound. Let's fix
> > those.
> > 
> > Fixes: b26fc590ce62 ("ip: add MACsec support")
> > Signed-off-by: Sabrina Dubroca 
> 
> Applied,
> How did it ever work??

I'm guessing it wrote over some other stack variables before their
first use. It worked without issue until the JSON patch.

Thanks,

-- 
Sabrina

Re: [PATCH bpf-next 01/13] bpf: btf: Break up btf_type_is_void()

2018-10-15 Thread Daniel Borkmann

On 10/12/2018 08:54 PM, Yonghong Song wrote:
> This patch breaks up btf_type_is_void() into
> btf_type_is_void() and btf_type_is_fwd().
> 
> It also adds btf_type_nosize() to better describe it is
> testing a type has nosize info.
> 
> Signed-off-by: Martin KaFai Lau 
> ---

Yonghong, your SoB is missing here.

Thanks,
Daniel

Re: [PATCH bpf-next 0/2] IPv6 sk-lookup fixes

2018-10-15 Thread Daniel Borkmann

On 10/15/2018 07:27 PM, Joe Stringer wrote:
> This series includes a couple of fixups for the IPv6 socket lookup
> helper, to make the API more consistent (always supply all arguments in
> network byte-order) and to allow its use when IPv6 is compiled as a
> module.
> 
> Joe Stringer (2):
>   bpf: Allow sk_lookup with IPv6 module
>   bpf: Fix IPv6 dport byte-order in bpf_sk_lookup
> 
>  include/net/addrconf.h |  5 +
>  net/core/filter.c  | 15 +--
>  net/ipv6/af_inet6.c|  1 +
>  3 files changed, 15 insertions(+), 6 deletions(-)
> 

LGTM, thanks for following up on this. Series:

Acked-by: Daniel Borkmann

Re: [PATCH net-next 11/18] vxlan: Add netif_is_vxlan()

2018-10-15 Thread Stephen Hemminger

On Mon, 15 Oct 2018 13:30:41 -0700
Jakub Kicinski  wrote:

> On Mon, 15 Oct 2018 23:27:41 +0300, Ido Schimmel wrote:
> > On Mon, Oct 15, 2018 at 01:16:42PM -0700, Stephen Hemminger wrote:  
> > > On Mon, 15 Oct 2018 22:57:48 +0300
> > > Ido Schimmel  wrote:
> > > 
> > > > On Mon, Oct 15, 2018 at 11:57:56AM -0700, Jakub Kicinski wrote:
> > > > > On Sat, 13 Oct 2018 17:18:38 +, Ido Schimmel wrote:  
> > > > > > Add the ability to determine whether a netdev is a VxLAN netdev by
> > > > > > calling the above mentioned function that checks the netdev's 
> > > > > > private
> > > > > > flags.
> > > > > > 
> > > > > > This will allow modules to identify netdev events involving a VxLAN
> > > > > > netdev and act accordingly. For example, drivers capable of VxLAN
> > > > > > offload will need to configure the underlying device when a VxLAN 
> > > > > > netdev
> > > > > > is being enslaved to an offloaded bridge.
> > > > > > 
> > > > > > Signed-off-by: Ido Schimmel 
> > > > > > Reviewed-by: Petr Machata   
> > > > > 
> > > > > Is this preferable over
> > > > > 
> > > > > !strcmp(netdev->rtnl_link_ops->kind, "vxlan")
> > > > > 
> > > > > which is what TC offloads do?  
> > > > 
> > > > Using a flag seemed like the more standard way.
> > > > 
> > > > That being said, we considered using net_device_ops instead, given we
> > > > are about to run out of available private flags, so I don't mind
> > > > adopting a technique already employed by another driver.
> > > > 
> > > > P.S. Had to Cc netdev again. I think your client somehow messed the Cc
> > > > list? I see Cc list in your reply, but with back slashes at the end of
> > > > two email addresses.
> > > 
> > > Agree that using a global resource bit in flags is probably overkill.
> > > If you can use kind that would be good example for other drivers as well. 
> > >
> > 
> > OK, will change.
> > 
> > Jakub, any objections if I implement netif_is_vxlan() using 'kind' and
> > convert nfp to use the helper? Having all these helpers in the same
> > location will increase the chances of others reusing them.  
> 
> Sounds very good :)

We could even do this for bridge, and other devices that are using private 
flags.

Re: [PATCH net-next 11/18] vxlan: Add netif_is_vxlan()

2018-10-15 Thread Jakub Kicinski

On Mon, 15 Oct 2018 23:27:41 +0300, Ido Schimmel wrote:
> On Mon, Oct 15, 2018 at 01:16:42PM -0700, Stephen Hemminger wrote:
> > On Mon, 15 Oct 2018 22:57:48 +0300
> > Ido Schimmel  wrote:
> >   
> > > On Mon, Oct 15, 2018 at 11:57:56AM -0700, Jakub Kicinski wrote:  
> > > > On Sat, 13 Oct 2018 17:18:38 +, Ido Schimmel wrote:
> > > > > Add the ability to determine whether a netdev is a VxLAN netdev by
> > > > > calling the above mentioned function that checks the netdev's private
> > > > > flags.
> > > > > 
> > > > > This will allow modules to identify netdev events involving a VxLAN
> > > > > netdev and act accordingly. For example, drivers capable of VxLAN
> > > > > offload will need to configure the underlying device when a VxLAN 
> > > > > netdev
> > > > > is being enslaved to an offloaded bridge.
> > > > > 
> > > > > Signed-off-by: Ido Schimmel 
> > > > > Reviewed-by: Petr Machata 
> > > > 
> > > > Is this preferable over
> > > > 
> > > > !strcmp(netdev->rtnl_link_ops->kind, "vxlan")
> > > > 
> > > > which is what TC offloads do?
> > > 
> > > Using a flag seemed like the more standard way.
> > > 
> > > That being said, we considered using net_device_ops instead, given we
> > > are about to run out of available private flags, so I don't mind
> > > adopting a technique already employed by another driver.
> > > 
> > > P.S. Had to Cc netdev again. I think your client somehow messed the Cc
> > > list? I see Cc list in your reply, but with back slashes at the end of
> > > two email addresses.  
> > 
> > Agree that using a global resource bit in flags is probably overkill.
> > If you can use kind that would be good example for other drivers as well.  
> 
> OK, will change.
> 
> Jakub, any objections if I implement netif_is_vxlan() using 'kind' and
> convert nfp to use the helper? Having all these helpers in the same
> location will increase the chances of others reusing them.

Sounds very good :)

Re: [PATCH net-next 11/18] vxlan: Add netif_is_vxlan()

2018-10-15 Thread Ido Schimmel

On Mon, Oct 15, 2018 at 01:16:42PM -0700, Stephen Hemminger wrote:
> On Mon, 15 Oct 2018 22:57:48 +0300
> Ido Schimmel  wrote:
> 
> > On Mon, Oct 15, 2018 at 11:57:56AM -0700, Jakub Kicinski wrote:
> > > On Sat, 13 Oct 2018 17:18:38 +, Ido Schimmel wrote:  
> > > > Add the ability to determine whether a netdev is a VxLAN netdev by
> > > > calling the above mentioned function that checks the netdev's private
> > > > flags.
> > > > 
> > > > This will allow modules to identify netdev events involving a VxLAN
> > > > netdev and act accordingly. For example, drivers capable of VxLAN
> > > > offload will need to configure the underlying device when a VxLAN netdev
> > > > is being enslaved to an offloaded bridge.
> > > > 
> > > > Signed-off-by: Ido Schimmel 
> > > > Reviewed-by: Petr Machata   
> > > 
> > > Is this preferable over
> > > 
> > > !strcmp(netdev->rtnl_link_ops->kind, "vxlan")
> > > 
> > > which is what TC offloads do?  
> > 
> > Using a flag seemed like the more standard way.
> > 
> > That being said, we considered using net_device_ops instead, given we
> > are about to run out of available private flags, so I don't mind
> > adopting a technique already employed by another driver.
> > 
> > P.S. Had to Cc netdev again. I think your client somehow messed the Cc
> > list? I see Cc list in your reply, but with back slashes at the end of
> > two email addresses.
> 
> Agree that using a global resource bit in flags is probably overkill.
> If you can use kind that would be good example for other drivers as well.

OK, will change.

Jakub, any objections if I implement netif_is_vxlan() using 'kind' and
convert nfp to use the helper? Having all these helpers in the same
location will increase the chances of others reusing them.

[iproute PATCH] ip-addrlabel: Fix printing of label value

2018-10-15 Thread Phil Sutter

Passing the return value of RTA_DATA() to rta_getattr_u32() is wrong
since that function will call RTA_DATA() by itself already.

Fixes: a7ad1c8a6845d ("ipaddrlabel: add json support")
Signed-off-by: Phil Sutter 
---
 ip/ipaddrlabel.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ip/ipaddrlabel.c b/ip/ipaddrlabel.c
index 2f79c56dcead2..8abe5722bafd1 100644
--- a/ip/ipaddrlabel.c
+++ b/ip/ipaddrlabel.c
@@ -95,7 +95,7 @@ int print_addrlabel(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg
}
 
if (tb[IFAL_LABEL] && RTA_PAYLOAD(tb[IFAL_LABEL]) == sizeof(uint32_t)) {
-   uint32_t label = rta_getattr_u32(RTA_DATA(tb[IFAL_LABEL]));
+   uint32_t label = rta_getattr_u32(tb[IFAL_LABEL]);
 
print_uint(PRINT_ANY,
   "label", "label %u ", label);
-- 
2.19.0

Re: [PATCH net-next 11/18] vxlan: Add netif_is_vxlan()

2018-10-15 Thread Stephen Hemminger

On Mon, 15 Oct 2018 22:57:48 +0300
Ido Schimmel  wrote:

> On Mon, Oct 15, 2018 at 11:57:56AM -0700, Jakub Kicinski wrote:
> > On Sat, 13 Oct 2018 17:18:38 +, Ido Schimmel wrote:  
> > > Add the ability to determine whether a netdev is a VxLAN netdev by
> > > calling the above mentioned function that checks the netdev's private
> > > flags.
> > > 
> > > This will allow modules to identify netdev events involving a VxLAN
> > > netdev and act accordingly. For example, drivers capable of VxLAN
> > > offload will need to configure the underlying device when a VxLAN netdev
> > > is being enslaved to an offloaded bridge.
> > > 
> > > Signed-off-by: Ido Schimmel 
> > > Reviewed-by: Petr Machata   
> > 
> > Is this preferable over
> > 
> > !strcmp(netdev->rtnl_link_ops->kind, "vxlan")
> > 
> > which is what TC offloads do?  
> 
> Using a flag seemed like the more standard way.
> 
> That being said, we considered using net_device_ops instead, given we
> are about to run out of available private flags, so I don't mind
> adopting a technique already employed by another driver.
> 
> P.S. Had to Cc netdev again. I think your client somehow messed the Cc
> list? I see Cc list in your reply, but with back slashes at the end of
> two email addresses.

Agree that using a global resource bit in flags is probably overkill.
If you can use kind that would be good example for other drivers as well.

Re: [PATCH bpf-next] tools: bpftool: add map create command

2018-10-15 Thread Alexei Starovoitov

On Mon, Oct 15, 2018 at 09:49:08AM -0700, Jakub Kicinski wrote:
> On Fri, 12 Oct 2018 23:16:59 -0700, Alexei Starovoitov wrote:
> > On Fri, Oct 12, 2018 at 11:06:14AM -0700, Jakub Kicinski wrote:
> > > Add a way of creating maps from user space.  The command takes
> > > as parameters most of the attributes of the map creation system
> > > call command.  After map is created its pinned to bpffs.  This makes
> > > it possible to easily and dynamically (without rebuilding programs)
> > > test various corner cases related to map creation.
> > > 
> > > Map type names are taken from bpftool's array used for printing.
> > > In general these days we try to make use of libbpf type names, but
> > > there are no map type names in libbpf as of today.
> > > 
> > > As with most features I add the motivation is testing (offloads) :)
> > > 
> > > Signed-off-by: Jakub Kicinski 
> > > Reviewed-by: Quentin Monnet   
> > ...
> > >   fprintf(stderr,
> > >   "Usage: %s %s { show | list }   [MAP]\n"
> > > + "   %s %s create FILE type TYPE key KEY_SIZE value 
> > > VALUE_SIZE \\\n"
> > > + "  entries MAX_ENTRIES [name NAME] 
> > > [flags FLAGS] \\\n"
> > > + "  [dev NAME]\n"  
> > 
> > I suspect as soon as bpftool has an ability to create standalone maps
> > some folks will start relying on such interface.
> 
> That'd be cool, do you see any real life use cases where its useful
> outside of corner case testing?

In our XDP use case we have an odd protocol for different apps to share
common prog_array that is pinned in bpffs.
If cmdline creation of it via bpftool was available that would have been
an option to consider. Not saying that it would have been a better option.
Just another option.

> 
> > Therefore I'd like to request to make 'name' argument to be mandatory.
> 
> Will do in v2!

thx!
 
> > I think in the future we will require BTF to be mandatory too.
> > We need to move towards more transparent and debuggable infra.
> > Do you think requiring json description of key/value would be managable to 
> > implement?
> > Then bpftool could convert it to BTF and the map full be fully defined.
> > I certainly understand that bpf prog can disregard the key/value layout 
> > today,
> > but we will make verifier to enforce that in the future too.
> 
> I was hoping that we can leave BTF support as a future extension, and
> then once we have the option for the verifier to enforce BTF (a sysctl?)
> the bpftool map create without a BTF will get rejected as one would
> expect.  

right. something like sysctl in the future.

> IOW it's fine not to make BTF required at bpftool level and
> leave it to system configuration.
> 
> I'd love to implement the BTF support right away, but I'm not sure I
> can afford that right now time-wise.  The whole map create command is
> pretty trivial, but for BTF we don't even have a way of dumping it
> AFAICT.  We can pretty print values, but what is the format in which to
> express the BTF itself?  We could do JSON, do we use an external
> library?  Should we have a separate BTF command for that?

I prefer standard C type description for both input and output :)
Anyway that wasn't a request for you to do it now. More of the feature
request for somebody to put on todo list :)

Re: [PATCH net-next 11/18] vxlan: Add netif_is_vxlan()

2018-10-15 Thread Ido Schimmel

On Mon, Oct 15, 2018 at 11:57:56AM -0700, Jakub Kicinski wrote:
> On Sat, 13 Oct 2018 17:18:38 +, Ido Schimmel wrote:
> > Add the ability to determine whether a netdev is a VxLAN netdev by
> > calling the above mentioned function that checks the netdev's private
> > flags.
> > 
> > This will allow modules to identify netdev events involving a VxLAN
> > netdev and act accordingly. For example, drivers capable of VxLAN
> > offload will need to configure the underlying device when a VxLAN netdev
> > is being enslaved to an offloaded bridge.
> > 
> > Signed-off-by: Ido Schimmel 
> > Reviewed-by: Petr Machata 
> 
> Is this preferable over
> 
> !strcmp(netdev->rtnl_link_ops->kind, "vxlan")
> 
> which is what TC offloads do?

Using a flag seemed like the more standard way.

That being said, we considered using net_device_ops instead, given we
are about to run out of available private flags, so I don't mind
adopting a technique already employed by another driver.

P.S. Had to Cc netdev again. I think your client somehow messed the Cc
list? I see Cc list in your reply, but with back slashes at the end of
two email addresses.

Re: [PATCH bpf-next v2 0/8] sockmap integration for ktls

2018-10-15 Thread Alexei Starovoitov

On Sat, Oct 13, 2018 at 02:45:55AM +0200, Daniel Borkmann wrote:
> This work adds a generic sk_msg layer and converts both sockmap
> and later ktls over to make use of it as a common data structure
> for application data (similarly as sk_buff for network packets).
> With that in place the sk_msg framework spans accross ULP layer
> in the kernel and allows for introspection or filtering of L7
> data with the help of BPF programs operating on a common input
> context.
> 
> In a second step, we enable the latter for ktls which was previously
> not possible, meaning, ktls and sk_msg verdict programs were
> mutually exclusive in the ULP layer which created challenges for
> the orchestrator when trying to apply TCP based policy, for
> example. Leveraging the prior consolidation we can finally overcome
> this limitation.
> 
> Note, there's no change in behavior when ktls is not used in
> combination with BPF, and also no change in behavior for stand
> alone sockmap. The kselftest suites for ktls, sockmap and ktls
> with sockmap combined also runs through successfully. For further
> details please see individual patches.
> 
> Thanks!
> 
> v1 -> v2:
>   - Removed leftover comment spotted by Alexei
>   - Improved commit messages, rebase

Applied, Thanks

[PATCH net-next] net: phy: merge phy_start_aneg and phy_start_aneg_priv

2018-10-15 Thread Heiner Kallweit

After commit 9f2959b6b52d ("net: phy: improve handling delayed work")
the sync parameter isn't needed any longer in phy_start_aneg_priv().
This allows to merge phy_start_aneg() and phy_start_aneg_priv().

Signed-off-by: Heiner Kallweit 
---
 drivers/net/phy/phy.c | 21 +++--
 1 file changed, 3 insertions(+), 18 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index d03bdbbd1..1d73ac330 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -482,16 +482,15 @@ static int phy_config_aneg(struct phy_device *phydev)
 }
 
 /**
- * phy_start_aneg_priv - start auto-negotiation for this PHY device
+ * phy_start_aneg - start auto-negotiation for this PHY device
  * @phydev: the phy_device struct
- * @sync: indicate whether we should wait for the workqueue cancelation
  *
  * Description: Sanitizes the settings (if we're not autonegotiating
  *   them), and then calls the driver's config_aneg function.
  *   If the PHYCONTROL Layer is operating, we change the state to
  *   reflect the beginning of Auto-negotiation or forcing.
  */
-static int phy_start_aneg_priv(struct phy_device *phydev, bool sync)
+int phy_start_aneg(struct phy_device *phydev)
 {
bool trigger = 0;
int err;
@@ -541,20 +540,6 @@ static int phy_start_aneg_priv(struct phy_device *phydev, 
bool sync)
 
return err;
 }
-
-/**
- * phy_start_aneg - start auto-negotiation for this PHY device
- * @phydev: the phy_device struct
- *
- * Description: Sanitizes the settings (if we're not autonegotiating
- *   them), and then calls the driver's config_aneg function.
- *   If the PHYCONTROL Layer is operating, we change the state to
- *   reflect the beginning of Auto-negotiation or forcing.
- */
-int phy_start_aneg(struct phy_device *phydev)
-{
-   return phy_start_aneg_priv(phydev, true);
-}
 EXPORT_SYMBOL(phy_start_aneg);
 
 static int phy_poll_aneg_done(struct phy_device *phydev)
@@ -1085,7 +1070,7 @@ void phy_state_machine(struct work_struct *work)
mutex_unlock(>lock);
 
if (needs_aneg)
-   err = phy_start_aneg_priv(phydev, false);
+   err = phy_start_aneg(phydev);
else if (do_suspend)
phy_suspend(phydev);
 
-- 
2.19.1

Re: [PATCH net] net/sched: properly init chain in case of multiple control actions

2018-10-15 Thread Cong Wang

On Sat, Oct 13, 2018 at 8:23 AM Davide Caratti  wrote:
>
> On Fri, 2018-10-12 at 13:57 -0700, Cong Wang wrote:
> > Why not just validate the fallback action in each action init()?
> > For example, checking tcfg_paction in tcf_gact_init().
> >
> > I don't see the need of making it generic.
>
> hello Cong, once again thanks for looking at this.
>
> what you say is doable, and I evaluated doing it before proposing this
> patch.
>
> But I felt unconfortable, because I needed to pass struct tcf_proto *tp in
> tcf_gact_init() to initialize a->goto_chain with the chain_idx encoded in
> the fallback action. So, I would have changed all the init() functions in
> all TC actions, just to fix two of them.
>
> A (legal?) trick  is to let tcf_action store the fallback action when it
> contains a 'goto chain' command, I just posted a proposal for gact. If you
> think it's ok, I will test and post the same for act_police.

Do we really need to support TC_ACT_GOTO_CHAIN for
gact->tcfg_paction etc.? I mean, is it useful in practice or is it just for
completeness?

IF we don't need to support it, we can just make it invalid without needing
to initialize it in ->init() at all.

If we do, however, we really need to move it into each ->init(), because
we have to lock each action if we are modifying an existing one. With
your patch, tcf_action_goto_chain_init() is still called without the per-action
lock.

What's more, if we support two different actions in gact, that is, tcfg_paction
and tcf_action, how could you still only have one a->goto_chain pointer?
There should be two pointers for each of them. :)

Thanks.

Re: [bpf-next PATCH v3 2/2] bpf: bpftool, add flag to allow non-compat map definitions

2018-10-15 Thread Jakub Kicinski

On Mon, 15 Oct 2018 11:19:55 -0700, John Fastabend wrote:
> Multiple map definition structures exist and user may have non-zero
> fields in their definition that are not recognized by bpftool and
> libbpf. The normal behavior is to then fail loading the map. Although
> this is a good default behavior users may still want to load the map
> for debugging or other reasons. This patch adds a --mapcompat flag
> that can be used to override the default behavior and allow loading
> the map even when it has additional non-zero fields.
> 
> For now the only user is 'bpftool prog' we can switch over other
> subcommands as needed. The library exposes an API that consumes
> a flags field now but I kept the original API around also in case
> users of the API don't want to expose this. The flags field is an
> int in case we need more control over how the API call handles
> errors/features/etc in the future.
> 
> Signed-off-by: John Fastabend 

Acked-by: Jakub Kicinski 

Thank you!

[bpf-next PATCH v3 2/2] bpf: bpftool, add flag to allow non-compat map definitions

2018-10-15 Thread John Fastabend

Multiple map definition structures exist and user may have non-zero
fields in their definition that are not recognized by bpftool and
libbpf. The normal behavior is to then fail loading the map. Although
this is a good default behavior users may still want to load the map
for debugging or other reasons. This patch adds a --mapcompat flag
that can be used to override the default behavior and allow loading
the map even when it has additional non-zero fields.

For now the only user is 'bpftool prog' we can switch over other
subcommands as needed. The library exposes an API that consumes
a flags field now but I kept the original API around also in case
users of the API don't want to expose this. The flags field is an
int in case we need more control over how the API call handles
errors/features/etc in the future.

Signed-off-by: John Fastabend 
---
 tools/bpf/bpftool/Documentation/bpftool.rst |4 
 tools/bpf/bpftool/bash-completion/bpftool   |2 +-
 tools/bpf/bpftool/main.c|7 ++-
 tools/bpf/bpftool/main.h|3 ++-
 tools/bpf/bpftool/prog.c|2 +-
 5 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool.rst 
b/tools/bpf/bpftool/Documentation/bpftool.rst
index 25c0872..6548831 100644
--- a/tools/bpf/bpftool/Documentation/bpftool.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool.rst
@@ -57,6 +57,10 @@ OPTIONS
-p, --pretty
  Generate human-readable JSON output. Implies **-j**.
 
+   -m, --mapcompat
+ Allow loading maps with unknown map definitions.
+
+
 SEE ALSO
 
**bpftool-map**\ (8), **bpftool-prog**\ (8), **bpftool-cgroup**\ (8)
diff --git a/tools/bpf/bpftool/bash-completion/bpftool 
b/tools/bpf/bpftool/bash-completion/bpftool
index 0826519..ac85207 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -184,7 +184,7 @@ _bpftool()
 
 # Deal with options
 if [[ ${words[cword]} == -* ]]; then
-local c='--version --json --pretty --bpffs'
+local c='--version --json --pretty --bpffs --mapcompat'
 COMPREPLY=( $( compgen -W "$c" -- "$cur" ) )
 return 0
 fi
diff --git a/tools/bpf/bpftool/main.c b/tools/bpf/bpftool/main.c
index 79dc3f1..828dde3 100644
--- a/tools/bpf/bpftool/main.c
+++ b/tools/bpf/bpftool/main.c
@@ -55,6 +55,7 @@
 bool pretty_output;
 bool json_output;
 bool show_pinned;
+int bpf_flags;
 struct pinned_obj_table prog_table;
 struct pinned_obj_table map_table;
 
@@ -341,6 +342,7 @@ int main(int argc, char **argv)
{ "pretty", no_argument,NULL,   'p' },
{ "version",no_argument,NULL,   'V' },
{ "bpffs",  no_argument,NULL,   'f' },
+   { "mapcompat",  no_argument,NULL,   'm' },
{ 0 }
};
int opt, ret;
@@ -355,7 +357,7 @@ int main(int argc, char **argv)
hash_init(map_table.table);
 
opterr = 0;
-   while ((opt = getopt_long(argc, argv, "Vhpjf",
+   while ((opt = getopt_long(argc, argv, "Vhpjfm",
  options, NULL)) >= 0) {
switch (opt) {
case 'V':
@@ -379,6 +381,9 @@ int main(int argc, char **argv)
case 'f':
show_pinned = true;
break;
+   case 'm':
+   bpf_flags = MAPS_RELAX_COMPAT;
+   break;
default:
p_err("unrecognized option '%s'", argv[optind - 1]);
if (json_output)
diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
index 40492cd..91fd697 100644
--- a/tools/bpf/bpftool/main.h
+++ b/tools/bpf/bpftool/main.h
@@ -74,7 +74,7 @@
 #define HELP_SPEC_PROGRAM  \
"PROG := { id PROG_ID | pinned FILE | tag PROG_TAG }"
 #define HELP_SPEC_OPTIONS  \
-   "OPTIONS := { {-j|--json} [{-p|--pretty}] | {-f|--bpffs} }"
+   "OPTIONS := { {-j|--json} [{-p|--pretty}] | {-f|--bpffs} | 
{-m|--mapcompat}"
 #define HELP_SPEC_MAP  \
"MAP := { id MAP_ID | pinned FILE }"
 
@@ -89,6 +89,7 @@ enum bpf_obj_type {
 extern json_writer_t *json_wtr;
 extern bool json_output;
 extern bool show_pinned;
+extern int bpf_flags;
 extern struct pinned_obj_table prog_table;
 extern struct pinned_obj_table map_table;
 
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index 99ab42c..3350289 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -908,7 +908,7 @@ static int do_load(int argc, char **argv)
}
}
 
-   obj = bpf_object__open_xattr();
+   obj = __bpf_object__open_xattr(, bpf_flags);
if (IS_ERR_OR_NULL(obj)) {
p_err("failed to

[bpf-next PATCH v3 1/2] bpf: bpftool, add support for attaching programs to maps

2018-10-15 Thread John Fastabend

Sock map/hash introduce support for attaching programs to maps. To
date I have been doing this with custom tooling but this is less than
ideal as we shift to using bpftool as the single CLI for our BPF uses.
This patch adds new sub commands 'attach' and 'detach' to the 'prog'
command to attach programs to maps and then detach them.

Signed-off-by: John Fastabend 
Reviewed-by: Jakub Kicinski 
---
 tools/bpf/bpftool/Documentation/bpftool-prog.rst |   11 ++
 tools/bpf/bpftool/Documentation/bpftool.rst  |2 
 tools/bpf/bpftool/bash-completion/bpftool|   19 
 tools/bpf/bpftool/prog.c |   99 ++
 4 files changed, 128 insertions(+), 3 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-prog.rst 
b/tools/bpf/bpftool/Documentation/bpftool-prog.rst
index 64156a1..12c8030 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-prog.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-prog.rst
@@ -25,6 +25,8 @@ MAP COMMANDS
 |  **bpftool** **prog dump jited**  *PROG* [{**file** *FILE* | 
**opcodes**}]
 |  **bpftool** **prog pin** *PROG* *FILE*
 |  **bpftool** **prog load** *OBJ* *FILE* [**type** *TYPE*] [**map** 
{**idx** *IDX* | **name** *NAME*} *MAP*] [**dev** *NAME*]
+|   **bpftool** **prog attach** *PROG* *ATTACH_TYPE* *MAP*
+|   **bpftool** **prog detach** *PROG* *ATTACH_TYPE* *MAP*
 |  **bpftool** **prog help**
 |
 |  *MAP* := { **id** *MAP_ID* | **pinned** *FILE* }
@@ -37,6 +39,7 @@ MAP COMMANDS
 |  **cgroup/bind4** | **cgroup/bind6** | **cgroup/post_bind4** | 
**cgroup/post_bind6** |
 |  **cgroup/connect4** | **cgroup/connect6** | **cgroup/sendmsg4** 
| **cgroup/sendmsg6**
 |  }
+|   *ATTACH_TYPE* := { **msg_verdict** | **skb_verdict** | **skb_parse** }
 
 
 DESCRIPTION
@@ -90,6 +93,14 @@ DESCRIPTION
 
  Note: *FILE* must be located in *bpffs* mount.
 
+**bpftool prog attach** *PROG* *ATTACH_TYPE* *MAP*
+  Attach bpf program *PROG* (with type specified by 
*ATTACH_TYPE*)
+  to the map *MAP*.
+
+**bpftool prog detach** *PROG* *ATTACH_TYPE* *MAP*
+  Detach bpf program *PROG* (with type specified by 
*ATTACH_TYPE*)
+  from the map *MAP*.
+
**bpftool prog help**
  Print short help message.
 
diff --git a/tools/bpf/bpftool/Documentation/bpftool.rst 
b/tools/bpf/bpftool/Documentation/bpftool.rst
index 8dda77d..25c0872 100644
--- a/tools/bpf/bpftool/Documentation/bpftool.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool.rst
@@ -26,7 +26,7 @@ SYNOPSIS
| **pin** | **event_pipe** | **help** }
 
*PROG-COMMANDS* := { **show** | **list** | **dump jited** | **dump 
xlated** | **pin**
-   | **load** | **help** }
+   | **load** | **attach** | **detach** | **help** }
 
*CGROUP-COMMANDS* := { **show** | **list** | **attach** | **detach** | 
**help** }
 
diff --git a/tools/bpf/bpftool/bash-completion/bpftool 
b/tools/bpf/bpftool/bash-completion/bpftool
index df1060b..0826519 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -292,6 +292,23 @@ _bpftool()
 fi
 return 0
 ;;
+attach|detach)
+if [[ ${#words[@]} == 7 ]]; then
+COMPREPLY=( $( compgen -W "id pinned" -- "$cur" ) )
+return 0
+fi
+
+if [[ ${#words[@]} == 6 ]]; then
+COMPREPLY=( $( compgen -W "msg_verdict skb_verdict 
skb_parse" -- "$cur" ) )
+return 0
+fi
+
+if [[ $prev == "$command" ]]; then
+COMPREPLY=( $( compgen -W "id pinned" -- "$cur" ) )
+return 0
+fi
+return 0
+;;
 load)
 local obj
 
@@ -347,7 +364,7 @@ _bpftool()
 ;;
 *)
 [[ $prev == $object ]] && \
-COMPREPLY=( $( compgen -W 'dump help pin load \
+COMPREPLY=( $( compgen -W 'dump help pin attach detach 
load \
 show list' -- "$cur" ) )
 ;;
 esac
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index b1cd3bc..99ab42c 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -77,6 +77,26 @@
[BPF_PROG_TYPE_FLOW_DISSECTOR]  = "flow_dissector",
 };
 
+static const char * const attach_type_strings[] = {
+   [BPF_SK_SKB_STREAM_PARSER] = "stream_parser",
+   [BPF_SK_SKB_STREAM_VERDICT] = "stream_verdict",
+   [BPF_SK_MSG_VERDICT] = "msg_verdict",
+   [__MAX_BPF_ATTACH_TYPE] = NULL,
+};
+
+enum bpf_attach_type parse_attach_type(const char *str)
+{
+   enum

[bpf-next PATCH v3 0/2] bpftool support for sockmap use cases

2018-10-15 Thread John Fastabend

The first patch adds support for attaching programs to maps. This is
needed to support sock{map|hash} use from bpftool. Currently, I carry
around custom code to do this so doing it using standard bpftool will
be great.

The second patch adds a compat mode to ignore non-zero entries in
the map def. This allows using bpftool with maps that have a extra
fields that the user knows can be ignored. This is needed to work
correctly with maps being loaded by other tools or directly via
syscalls.

v3: add bash completion and doc updates for --mapcompat

---

John Fastabend (2):
  bpf: bpftool, add support for attaching programs to maps
  bpf: bpftool, add flag to allow non-compat map definitions


 tools/bpf/bpftool/Documentation/bpftool-prog.rst |   11 ++
 tools/bpf/bpftool/Documentation/bpftool.rst  |6 +
 tools/bpf/bpftool/bash-completion/bpftool|   21 -
 tools/bpf/bpftool/main.c |7 +-
 tools/bpf/bpftool/main.h |3 -
 tools/bpf/bpftool/prog.c |  101 ++
 6 files changed, 142 insertions(+), 7 deletions(-)

--
Signature

Re: [PATCH stable 4.9 v2 00/29] backport of IP fragmentation fixes

2018-10-15 Thread Eric Dumazet

On Mon, Oct 15, 2018 at 10:47 AM Florian Fainelli  wrote:
>
>
>
> On 10/10/2018 12:29 PM, Florian Fainelli wrote:
> > This is based on Stephen's v4.14 patches, with the necessary merge
> > conflicts, and the lack of timer_setup() on the 4.9 baseline.
> >
> > Perf results on a gigabit capable system, before and after are below.
> >
> > Series can also be found here:
> >
> > https://github.com/ffainelli/linux/commits/fragment-stack-v4.9-v2
> >
> > Changes in v2:
> >
> > - drop "net: sk_buff rbnode reorg"
> > - added original "ip: use rb trees for IP frag queue." commit
>
> Eric, does this look reasonable to you?

Yes, thanks a lot Florian.

>
> >
> > Before patches:
> >
> >PerfTop: 180 irqs/sec  kernel:78.9%  exact:  0.0% [4000Hz 
> > cycles:ppp],  (all, 4 CPUs)
> > ---
> >
> > 34.81%  [kernel]   [k] ip_defrag
> >  4.57%  [kernel]   [k] arch_cpu_idle
> >  2.09%  [kernel]   [k] fib_table_lookup
> >  1.74%  [kernel]   [k] finish_task_switch
> >  1.57%  [kernel]   [k] v7_dma_inv_range
> >  1.47%  [kernel]   [k] __netif_receive_skb_core
> >  1.06%  [kernel]   [k] __slab_free
> >  1.04%  [kernel]   [k] __netdev_alloc_skb
> >  0.99%  [kernel]   [k] ip_route_input_noref
> >  0.96%  [kernel]   [k] dev_gro_receive
> >  0.96%  [kernel]   [k] tick_nohz_idle_enter
> >  0.93%  [kernel]   [k] bcm_sysport_poll
> >  0.92%  [kernel]   [k] skb_release_data
> >  0.91%  [kernel]   [k] __memzero
> >  0.90%  [kernel]   [k] __free_page_frag
> >  0.87%  [kernel]   [k] ip_rcv
> >  0.77%  [kernel]   [k] eth_type_trans
> >  0.71%  [kernel]   [k] _raw_spin_unlock_irqrestore
> >  0.68%  [kernel]   [k] tick_nohz_idle_exit
> >  0.65%  [kernel]   [k] bcm_sysport_rx_refill
> >
> > After patches:
> >
> >PerfTop: 214 irqs/sec  kernel:80.4%  exact:  0.0% [4000Hz 
> > cycles:ppp],  (all, 4 CPUs)
> > ---
> >
> >  6.61%  [kernel]   [k] arch_cpu_idle
> >  3.77%  [kernel]   [k] ip_defrag
> >  3.65%  [kernel]   [k] v7_dma_inv_range
> >  3.18%  [kernel]   [k] fib_table_lookup
> >  3.04%  [kernel]   [k] __netif_receive_skb_core
> >  2.31%  [kernel]   [k] finish_task_switch
> >  2.31%  [kernel]   [k] _raw_spin_unlock_irqrestore
> >  1.65%  [kernel]   [k] bcm_sysport_poll
> >  1.63%  [kernel]   [k] ip_route_input_noref
> >  1.63%  [kernel]   [k] __memzero
> >  1.58%  [kernel]   [k] __netdev_alloc_skb
> >  1.47%  [kernel]   [k] tick_nohz_idle_enter
> >  1.40%  [kernel]   [k] __slab_free
> >  1.32%  [kernel]   [k] ip_rcv
> >  1.32%  [kernel]   [k] __softirqentry_text_start
> >  1.30%  [kernel]   [k] dev_gro_receive
> >  1.23%  [kernel]   [k] bcm_sysport_rx_refill
> >  1.11%  [kernel]   [k] tick_nohz_idle_exit
> >  1.06%  [kernel]   [k] memcmp
> >  1.02%  [kernel]   [k] dma_cache_maint_page
> >
> >
> > Dan Carpenter (1):
> >   ipv4: frags: precedence bug in ip_expire()
> >
> > Eric Dumazet (21):
> >   inet: frags: change inet_frags_init_net() return value
> >   inet: frags: add a pointer to struct netns_frags
> >   inet: frags: refactor ipfrag_init()
> >   inet: frags: refactor ipv6_frag_init()
> >   inet: frags: refactor lowpan_net_frag_init()
> >   ipv6: export ip6 fragments sysctl to unprivileged users
> >   rhashtable: add schedule points
> >   inet: frags: use rhashtables for reassembly units
> >   inet: frags: remove some helpers
> >   inet: frags: get rif of inet_frag_evicting()
> >   inet: frags: remove inet_frag_maybe_warn_overflow()
> >   inet: frags: break the 2GB limit for frags storage
> >   inet: frags: do not clone skb in ip_expire()
> >   ipv6: frags: rewrite ip6_expire_frag_queue()
> >   rhashtable: reorganize struct rhashtable layout
> >   inet: frags: reorganize struct netns_frags
> >   inet: frags: get rid of ipfrag_skb_cb/FRAG_CB
> >   inet: frags: fix ip6frag_low_thresh boundary
> >   net: speed up skb_rbtree_purge()
> >   net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends
> >   net: add rb_to_skb() and other rb tree helpers
> >
> > Florian Westphal (1):
> >   ipv6: defrag: drop non-last frags smaller than min mtu
> >
> > Peter Oskolkov (5):
> >   ip: discard IPv4 datagrams with overlapping segments.
> >   net: modify skb_rbtree_purge to return the truesize of all purged
> > skbs.
> >   ip: use rb trees for IP frag queue.
> >   ip: add helpers to process in-order fragments faster.
> >   ip: process in-order fragments efficiently
> >
> > Taehee Yoo (1):
> >   ip: frags: fix crash in ip_do_fragment()
> >
> >  Documentation/networking/ip-sysctl.txt  |  13 +-
> >  include/linux/rhashtable.h  |   4 +-
> >  include/linux/skbuff.h

Re: [PATCH stable 4.9 v2 00/29] backport of IP fragmentation fixes

2018-10-15 Thread Florian Fainelli




On 10/10/2018 12:29 PM, Florian Fainelli wrote:
> This is based on Stephen's v4.14 patches, with the necessary merge
> conflicts, and the lack of timer_setup() on the 4.9 baseline.
> 
> Perf results on a gigabit capable system, before and after are below.
> 
> Series can also be found here:
> 
> https://github.com/ffainelli/linux/commits/fragment-stack-v4.9-v2
> 
> Changes in v2:
> 
> - drop "net: sk_buff rbnode reorg"
> - added original "ip: use rb trees for IP frag queue." commit

Eric, does this look reasonable to you?

> 
> Before patches:
> 
>PerfTop: 180 irqs/sec  kernel:78.9%  exact:  0.0% [4000Hz cycles:ppp], 
>  (all, 4 CPUs)
> ---
> 
> 34.81%  [kernel]   [k] ip_defrag
>  4.57%  [kernel]   [k] arch_cpu_idle
>  2.09%  [kernel]   [k] fib_table_lookup
>  1.74%  [kernel]   [k] finish_task_switch
>  1.57%  [kernel]   [k] v7_dma_inv_range
>  1.47%  [kernel]   [k] __netif_receive_skb_core
>  1.06%  [kernel]   [k] __slab_free
>  1.04%  [kernel]   [k] __netdev_alloc_skb
>  0.99%  [kernel]   [k] ip_route_input_noref
>  0.96%  [kernel]   [k] dev_gro_receive
>  0.96%  [kernel]   [k] tick_nohz_idle_enter
>  0.93%  [kernel]   [k] bcm_sysport_poll
>  0.92%  [kernel]   [k] skb_release_data
>  0.91%  [kernel]   [k] __memzero
>  0.90%  [kernel]   [k] __free_page_frag
>  0.87%  [kernel]   [k] ip_rcv
>  0.77%  [kernel]   [k] eth_type_trans
>  0.71%  [kernel]   [k] _raw_spin_unlock_irqrestore
>  0.68%  [kernel]   [k] tick_nohz_idle_exit
>  0.65%  [kernel]   [k] bcm_sysport_rx_refill
> 
> After patches:
> 
>PerfTop: 214 irqs/sec  kernel:80.4%  exact:  0.0% [4000Hz cycles:ppp], 
>  (all, 4 CPUs)
> ---
> 
>  6.61%  [kernel]   [k] arch_cpu_idle
>  3.77%  [kernel]   [k] ip_defrag
>  3.65%  [kernel]   [k] v7_dma_inv_range
>  3.18%  [kernel]   [k] fib_table_lookup
>  3.04%  [kernel]   [k] __netif_receive_skb_core
>  2.31%  [kernel]   [k] finish_task_switch
>  2.31%  [kernel]   [k] _raw_spin_unlock_irqrestore
>  1.65%  [kernel]   [k] bcm_sysport_poll
>  1.63%  [kernel]   [k] ip_route_input_noref
>  1.63%  [kernel]   [k] __memzero
>  1.58%  [kernel]   [k] __netdev_alloc_skb
>  1.47%  [kernel]   [k] tick_nohz_idle_enter
>  1.40%  [kernel]   [k] __slab_free
>  1.32%  [kernel]   [k] ip_rcv
>  1.32%  [kernel]   [k] __softirqentry_text_start
>  1.30%  [kernel]   [k] dev_gro_receive
>  1.23%  [kernel]   [k] bcm_sysport_rx_refill
>  1.11%  [kernel]   [k] tick_nohz_idle_exit
>  1.06%  [kernel]   [k] memcmp
>  1.02%  [kernel]   [k] dma_cache_maint_page
> 
> 
> Dan Carpenter (1):
>   ipv4: frags: precedence bug in ip_expire()
> 
> Eric Dumazet (21):
>   inet: frags: change inet_frags_init_net() return value
>   inet: frags: add a pointer to struct netns_frags
>   inet: frags: refactor ipfrag_init()
>   inet: frags: refactor ipv6_frag_init()
>   inet: frags: refactor lowpan_net_frag_init()
>   ipv6: export ip6 fragments sysctl to unprivileged users
>   rhashtable: add schedule points
>   inet: frags: use rhashtables for reassembly units
>   inet: frags: remove some helpers
>   inet: frags: get rif of inet_frag_evicting()
>   inet: frags: remove inet_frag_maybe_warn_overflow()
>   inet: frags: break the 2GB limit for frags storage
>   inet: frags: do not clone skb in ip_expire()
>   ipv6: frags: rewrite ip6_expire_frag_queue()
>   rhashtable: reorganize struct rhashtable layout
>   inet: frags: reorganize struct netns_frags
>   inet: frags: get rid of ipfrag_skb_cb/FRAG_CB
>   inet: frags: fix ip6frag_low_thresh boundary
>   net: speed up skb_rbtree_purge()
>   net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends
>   net: add rb_to_skb() and other rb tree helpers
> 
> Florian Westphal (1):
>   ipv6: defrag: drop non-last frags smaller than min mtu
> 
> Peter Oskolkov (5):
>   ip: discard IPv4 datagrams with overlapping segments.
>   net: modify skb_rbtree_purge to return the truesize of all purged
> skbs.
>   ip: use rb trees for IP frag queue.
>   ip: add helpers to process in-order fragments faster.
>   ip: process in-order fragments efficiently
> 
> Taehee Yoo (1):
>   ip: frags: fix crash in ip_do_fragment()
> 
>  Documentation/networking/ip-sysctl.txt  |  13 +-
>  include/linux/rhashtable.h  |   4 +-
>  include/linux/skbuff.h  |  34 +-
>  include/net/inet_frag.h | 133 +++---
>  include/net/ip.h|   1 -
>  include/net/ipv6.h  |  26 +-
>  include/uapi/linux/snmp.h   |   1 +
>  lib/rhashtable.c|   5 +-
>  net/core/skbuff.c

[PATCH bpf-next 1/2] bpf: Allow sk_lookup with IPv6 module

2018-10-15 Thread Joe Stringer

This is a more complete fix than d71019b54bff ("net: core: Fix build
with CONFIG_IPV6=m"), so that IPv6 sockets may be looked up if the IPv6
module is loaded (not just if it's compiled in).

Signed-off-by: Joe Stringer 
---
 include/net/addrconf.h |  5 +
 net/core/filter.c  | 12 +++-
 net/ipv6/af_inet6.c|  1 +
 3 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 6def0351bcc3..14b789a123e7 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -265,6 +265,11 @@ extern const struct ipv6_stub *ipv6_stub __read_mostly;
 struct ipv6_bpf_stub {
int (*inet6_bind)(struct sock *sk, struct sockaddr *uaddr, int addr_len,
  bool force_bind_address_no_port, bool with_lock);
+   struct sock *(*udp6_lib_lookup)(struct net *net,
+   const struct in6_addr *saddr, __be16 
sport,
+   const struct in6_addr *daddr, __be16 
dport,
+   int dif, int sdif, struct udp_table 
*tbl,
+   struct sk_buff *skb);
 };
 extern const struct ipv6_bpf_stub *ipv6_bpf_stub __read_mostly;
 
diff --git a/net/core/filter.c b/net/core/filter.c
index b844761b5d4c..21aba2a521c7 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4842,7 +4842,7 @@ static struct sock *sk_lookup(struct net *net, struct 
bpf_sock_tuple *tuple,
sk = __udp4_lib_lookup(net, src4, tuple->ipv4.sport,
   dst4, tuple->ipv4.dport,
   dif, sdif, _table, skb);
-#if IS_REACHABLE(CONFIG_IPV6)
+#if IS_ENABLED(CONFIG_IPV6)
} else {
struct in6_addr *src6 = (struct in6_addr *)>ipv6.saddr;
struct in6_addr *dst6 = (struct in6_addr *)>ipv6.daddr;
@@ -4853,10 +4853,12 @@ static struct sock *sk_lookup(struct net *net, struct 
bpf_sock_tuple *tuple,
src6, tuple->ipv6.sport,
dst6, tuple->ipv6.dport,
dif, sdif, );
-   else
-   sk = __udp6_lib_lookup(net, src6, tuple->ipv6.sport,
-  dst6, tuple->ipv6.dport,
-  dif, sdif, _table, skb);
+   else if (likely(ipv6_bpf_stub))
+   sk = ipv6_bpf_stub->udp6_lib_lookup(net,
+   src6, 
tuple->ipv6.sport,
+   dst6, 
tuple->ipv6.dport,
+   dif, sdif,
+   _table, skb);
 #endif
}
 
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index e9c8cfdf4b4c..3f4d61017a69 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -901,6 +901,7 @@ static const struct ipv6_stub ipv6_stub_impl = {
 
 static const struct ipv6_bpf_stub ipv6_bpf_stub_impl = {
.inet6_bind = __inet6_bind,
+   .udp6_lib_lookup = __udp6_lib_lookup,
 };
 
 static int __init inet6_init(void)
-- 
2.17.1

[PATCH bpf-next 0/2] IPv6 sk-lookup fixes

2018-10-15 Thread Joe Stringer

This series includes a couple of fixups for the IPv6 socket lookup
helper, to make the API more consistent (always supply all arguments in
network byte-order) and to allow its use when IPv6 is compiled as a
module.

Joe Stringer (2):
  bpf: Allow sk_lookup with IPv6 module
  bpf: Fix IPv6 dport byte-order in bpf_sk_lookup

 include/net/addrconf.h |  5 +
 net/core/filter.c  | 15 +--
 net/ipv6/af_inet6.c|  1 +
 3 files changed, 15 insertions(+), 6 deletions(-)

-- 
2.17.1

[PATCH bpf-next 2/2] bpf: Fix IPv6 dport byte-order in bpf_sk_lookup

2018-10-15 Thread Joe Stringer

Commit 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF")
mistakenly passed the destination port in network byte-order to the IPv6
TCP/UDP socket lookup functions, which meant that BPF writers would need
to either manually swap the byte-order of this field or otherwise IPv6
sockets could not be located via this helper.

Fix the issue by swapping the byte-order appropriately in the helper.
This also makes the API more consistent with the IPv4 version.

Fixes: 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF")
Signed-off-by: Joe Stringer 
---
 net/core/filter.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 21aba2a521c7..d877c4c599ce 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4846,17 +4846,18 @@ static struct sock *sk_lookup(struct net *net, struct 
bpf_sock_tuple *tuple,
} else {
struct in6_addr *src6 = (struct in6_addr *)>ipv6.saddr;
struct in6_addr *dst6 = (struct in6_addr *)>ipv6.daddr;
+   u16 hnum = ntohs(tuple->ipv6.dport);
int sdif = inet6_sdif(skb);
 
if (proto == IPPROTO_TCP)
sk = __inet6_lookup(net, _hashinfo, skb, 0,
src6, tuple->ipv6.sport,
-   dst6, tuple->ipv6.dport,
+   dst6, hnum,
dif, sdif, );
else if (likely(ipv6_bpf_stub))
sk = ipv6_bpf_stub->udp6_lib_lookup(net,
src6, 
tuple->ipv6.sport,
-   dst6, 
tuple->ipv6.dport,
+   dst6, hnum,
dif, sdif,
_table, skb);
 #endif
-- 
2.17.1

Read Business Letter

2018-10-15 Thread info

Steven Peter Walker(Esq)
Stone Chambers, 4 Field Court,
Gray's Inn, London,
WC1R 5EF..
Email: stevenwalkerchamb...@workmail.co.za

Greetings To You,

This is a personal email directed to you and I request that it be 
treated as such. I am Steven Walker, a personal attorney/sole 
executor to the late Engineer Robert M, herein after referred to 
as" my client" I represent the interest of my client killed with 
his immediate family in a fatal motor accident in East London on 
November 5, 2002.and I will like to negotiate the terms of 
investment of resources available to him.

My late client worked as consulting engineer & sub-comptroller 
with Genesis Oil and Gas Consultants Ltd here in the United 
Kingdom and had left behind a deposit of Six Million Eight 
Hundred Thousand British Pounds Sterling only (£6.8million) with 
a finance company. The funds originated from contract 
transactions he executed in his registered area of business. Just 
after his death, I was contacted by the finance house to provide 
his next of kin, reasons been that his deposit agreement contains 
a residuary clause giving his personal attorney express authority 
to nominate the beneficiary to his funds. Unknown to the bank, 
Robert had left no possible trace of any of his close relative 
with me, making all efforts in my part to locate his family 
relative to be unfruitful since his death. In addition, from 
Robert's own story, he was only adopted and his foster parents 
whom he lost in 1976, according to him had no possible trace of 
his real family.

The funds had remained unclaimed since his death, but I had made 
effort writing several letters to the embassy with intent to 
locate any of his extended relatives whom shall be 
claimants/beneficiaries of his abandoned personal estate, and all 
such efforts have been to no avail. More so, I have received 
official letters in the last few weeks suggesting a likely 
proceeding for confiscation of his abandoned personal assets in 
line with existing laws by the bank However, it will interest you 
to know that I discovered that some directors of this finance 
company are making plans already to have this fund to themselves 
only to use the excuse that since I am unable to find a next of 
kin to my late client then the funds should be confiscated, 
meanwhile their intentions is to have the funds retrieved for 
themselves.

I reasoned very professionally and resolved to use a legal means 
to retrieve the abandoned funds, and that is to present the next 
of kin of my deceased client to the bank. This is legally 
possible and would be done in accordance with the laws. On this 
note, I decided to search for a credible person and finding that 
you bear a similar last name, I was urged to contact you, that I 
may, with your consent, present you to the "trustee" bank as my 
late client's surviving family member so as to enable you put up 
a claim to the bank in that capacity as a next of kin of my 
client. I find this to be possible for the fuller reasons that 
you are of the same nationality and you bear a similar last name 
with my late client making it a lot easier for you to put up a 
claim in that capacity. I have all vital documents that would 
confer you the legal right to lay claim to the funds, and it 
would back up your claim. I am willing to make these documents 
available to you so that the proceeds of this bank account valued 
at £6.8million can be paid to you before it is confiscated or 
declared unserviceable to the bank where this huge amount is 
lodged.

I do sincerely sympathize the death of my client but I think that 
it is unprofitable for his funds to be submitted to the 
government of this country or some financial institution. I seek 
your assistance since I have been unable to locate the relatives 
for the past three years now and since no one would come for the 
claim. I seek your consent to present you as the next of kin of 
the deceased since you have the same last name giving you the 
advantage which also makes the claim most credible . In that 
stand, the proceeds of this account can be paid to you. Then, we 
talk about percentage. I know there are others with the same 
surname as my client, but after a little search, my instinct 
tells me to contact you. I shall assemble all the necessary 
documents that would be used to back up your claim.

I guarantee that this will be executed under a legitimate 
arrangement that will protect you from any breach of law. I will 
not fail to bring to your notice that this proposal is hitch-free 
and that you should not entertain any fears as the required 
arrangements have been made for the completion of this transfer. 
As I said, I require only a solemn confidentiality on this. 
Please get in touch via my alternative 
email{stevenwalkerchamb...@workmail.co.za} for better 
confidentiality and if it's okay to you send me your telephone 
and fax numbers to enable us discuss further on this transaction, 
please do not take undue

Re: [PATCH bpf-next] tools: bpftool: add map create command

2018-10-15 Thread Jakub Kicinski

On Fri, 12 Oct 2018 23:16:59 -0700, Alexei Starovoitov wrote:
> On Fri, Oct 12, 2018 at 11:06:14AM -0700, Jakub Kicinski wrote:
> > Add a way of creating maps from user space.  The command takes
> > as parameters most of the attributes of the map creation system
> > call command.  After map is created its pinned to bpffs.  This makes
> > it possible to easily and dynamically (without rebuilding programs)
> > test various corner cases related to map creation.
> > 
> > Map type names are taken from bpftool's array used for printing.
> > In general these days we try to make use of libbpf type names, but
> > there are no map type names in libbpf as of today.
> > 
> > As with most features I add the motivation is testing (offloads) :)
> > 
> > Signed-off-by: Jakub Kicinski 
> > Reviewed-by: Quentin Monnet   
> ...
> > fprintf(stderr,
> > "Usage: %s %s { show | list }   [MAP]\n"
> > +   "   %s %s create FILE type TYPE key KEY_SIZE value 
> > VALUE_SIZE \\\n"
> > +   "  entries MAX_ENTRIES [name NAME] 
> > [flags FLAGS] \\\n"
> > +   "  [dev NAME]\n"  
> 
> I suspect as soon as bpftool has an ability to create standalone maps
> some folks will start relying on such interface.

That'd be cool, do you see any real life use cases where its useful
outside of corner case testing?

> Therefore I'd like to request to make 'name' argument to be mandatory.

Will do in v2!

> I think in the future we will require BTF to be mandatory too.
> We need to move towards more transparent and debuggable infra.
> Do you think requiring json description of key/value would be managable to 
> implement?
> Then bpftool could convert it to BTF and the map full be fully defined.
> I certainly understand that bpf prog can disregard the key/value layout today,
> but we will make verifier to enforce that in the future too.

I was hoping that we can leave BTF support as a future extension, and
then once we have the option for the verifier to enforce BTF (a sysctl?)
the bpftool map create without a BTF will get rejected as one would
expect.  IOW it's fine not to make BTF required at bpftool level and
leave it to system configuration.

I'd love to implement the BTF support right away, but I'm not sure I
can afford that right now time-wise.  The whole map create command is
pretty trivial, but for BTF we don't even have a way of dumping it
AFAICT.  We can pretty print values, but what is the format in which to
express the BTF itself?  We could do JSON, do we use an external
library?  Should we have a separate BTF command for that?

Re: [PATCH iproute 2/2] utils: fix get_rtnl_link_stats_rta stats parsing

2018-10-15 Thread Stephen Hemminger

On Thu, 11 Oct 2018 14:24:03 +0200
Lorenzo Bianconi  wrote:

> > > iproute2 walks through the list of available tunnels using netlink
> > > protocol in order to get device info instead of reading
> > > them from proc filesystem. However the kernel reports device statistics
> > > using IFLA_INET6_STATS/IFLA_INET6_ICMP6STATS attributes nested in
> > > IFLA_PROTINFO one but iproutes expects these info in
> > > IFLA_STATS64/IFLA_STATS attributes.
> > > The issue can be triggered with the following reproducer:
> > > 
> > > $ip link add ip6d0 type ip6tnl mode ip6ip6 local ::1 remote ::1
> > > $ip -6 -d -s tunnel show ip6d0
> > > ip6d0: ipv6/ipv6 remote ::1 local ::1 encaplimit 4 hoplimit 64
> > > tclass 0x00 flowlabel 0x0 (flowinfo 0x)
> > > Dump terminated
> > > 
> > > Fix the issue introducing IFLA_INET6_STATS attribute parsing
> > > 
> > > Fixes: 3e953938717f ("iptunnel/ip6tunnel: Use netlink to walk through
> > > tunnels list")
> > > 
> > > Signed-off-by: Lorenzo Bianconi   
> > 
> > Can't we fix the kernel to report statistics properly, rather than
> > starting iproute2 doing more /proc interfaces.
> >   
> 
> Hi Stephen,
> 
> sorry, I did not get what you mean. Current iproute implementation
> walks through tunnels list using netlink protocol and parses device
> statistics in the kernel netlink message. However it does not take
> into account the actual netlink message layout since the statistic
> attribute is nested in IFLA_PROTINFO one.
> Moreover AFAIU the related kernel code has not changed since iproute
> commit 3e953938717f, so I guess we should fix the issue in iproute code
> instead in the kernel one. Do you agree?
> 
> Regards,
> Lorenzo

Applied to current iproute2.

[PATCH net-next 7/7] tcp: cdg: use tcp high resolution clock cache

2018-10-15 Thread Eric Dumazet

We store in tcp socket a cache of most recent high resolution
clock, there is no need to call local_clock() again, since
this cache is good enough.

Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_cdg.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_cdg.c b/net/ipv4/tcp_cdg.c
index 
06fbe102a425f28b43294925d8d13af4a13ec776..37eebd9103961be4731323cfb4d933b51954e802
 100644
--- a/net/ipv4/tcp_cdg.c
+++ b/net/ipv4/tcp_cdg.c
@@ -146,7 +146,7 @@ static void tcp_cdg_hystart_update(struct sock *sk)
return;
 
if (hystart_detect & HYSTART_ACK_TRAIN) {
-   u32 now_us = div_u64(local_clock(), NSEC_PER_USEC);
+   u32 now_us = tp->tcp_mstamp;
 
if (ca->last_ack == 0 || !tcp_is_cwnd_limited(sk)) {
ca->last_ack = now_us;
-- 
2.19.0.605.g01d371f741-goog

[PATCH net-next 3/7] tcp: mitigate scheduling jitter in EDT pacing model

2018-10-15 Thread Eric Dumazet

In commit fefa569a9d4b ("net_sched: sch_fq: account for schedule/timers
drifts") we added a mitigation for scheduling jitter in fq packet scheduler.

This patch does the same in TCP stack, now it is using EDT model.

Note that this mitigation is valid for both external (fq packet scheduler)
or internal TCP pacing.

This uses the same strategy than the above commit, allowing
a time credit of half the packet currently sent.

Consider following case :

An skb is sent, after an idle period of 300 usec.
The air-time (skb->len/pacing_rate) is 500 usec
Instead of setting the pacing timer to now+500 usec,
it will use now+min(500/2, 300) -> now+250usec

This is like having a token bucket with a depth of half
an skb.

Tested:

tc qdisc replace dev eth0 root pfifo_fast

Before
netperf -P0 -H remote -- -q 10 # 8000Mbit
54 262144 26214410.007710.43

After :
netperf -P0 -H remote -- -q 10 # 8000 Mbit
54 262144 26214410.007999.75   # Much closer to 8000Mbit target

Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_output.c | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 
f4aa4109334a043d02b17b18bef346d805dab501..5474c9854f252e50cdb1136435417873861d7618
 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -985,7 +985,8 @@ static void tcp_internal_pacing(struct sock *sk)
sock_hold(sk);
 }
 
-static void tcp_update_skb_after_send(struct sock *sk, struct sk_buff *skb)
+static void tcp_update_skb_after_send(struct sock *sk, struct sk_buff *skb,
+ u64 prior_wstamp)
 {
struct tcp_sock *tp = tcp_sk(sk);
 
@@ -998,7 +999,12 @@ static void tcp_update_skb_after_send(struct sock *sk, 
struct sk_buff *skb)
 * this is a minor annoyance.
 */
if (rate != ~0UL && rate && tp->data_segs_out >= 10) {
-   tp->tcp_wstamp_ns += div64_ul((u64)skb->len * 
NSEC_PER_SEC, rate);
+   u64 len_ns = div64_ul((u64)skb->len * NSEC_PER_SEC, 
rate);
+   u64 credit = tp->tcp_wstamp_ns - prior_wstamp;
+
+   /* take into account OS jitter */
+   len_ns -= min_t(u64, len_ns / 2, credit);
+   tp->tcp_wstamp_ns += len_ns;
 
tcp_internal_pacing(sk);
}
@@ -1029,6 +1035,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct 
sk_buff *skb,
struct sk_buff *oskb = NULL;
struct tcp_md5sig_key *md5;
struct tcphdr *th;
+   u64 prior_wstamp;
int err;
 
BUG_ON(!skb || !tcp_skb_pcount(skb));
@@ -1050,7 +1057,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct 
sk_buff *skb,
return -ENOBUFS;
}
 
-   /* TODO: might take care of jitter here */
+   prior_wstamp = tp->tcp_wstamp_ns;
tp->tcp_wstamp_ns = max(tp->tcp_wstamp_ns, tp->tcp_clock_cache);
 
skb->skb_mstamp_ns = tp->tcp_wstamp_ns;
@@ -1169,7 +1176,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct 
sk_buff *skb,
err = net_xmit_eval(err);
}
if (!err && oskb) {
-   tcp_update_skb_after_send(sk, oskb);
+   tcp_update_skb_after_send(sk, oskb, prior_wstamp);
tcp_rate_skb_sent(sk, oskb);
}
return err;
@@ -2321,7 +2328,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int 
mss_now, int nonagle,
 
if (unlikely(tp->repair) && tp->repair_queue == TCP_SEND_QUEUE) 
{
/* "skb_mstamp" is used as a start point for the 
retransmit timer */
-   tcp_update_skb_after_send(sk, skb);
+   tcp_update_skb_after_send(sk, skb, tp->tcp_wstamp_ns);
goto repair; /* Skip network transmission */
}
 
@@ -2896,7 +2903,7 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff 
*skb, int segs)
} tcp_skb_tsorted_restore(skb);
 
if (!err) {
-   tcp_update_skb_after_send(sk, skb);
+   tcp_update_skb_after_send(sk, skb, tp->tcp_wstamp_ns);
tcp_rate_skb_sent(sk, skb);
}
} else {
-- 
2.19.0.605.g01d371f741-goog

[PATCH net-next 2/7] net: extend sk_pacing_rate to unsigned long

2018-10-15 Thread Eric Dumazet

sk_pacing_rate has beed introduced as a u32 field in 2013,
effectively limiting per flow pacing to 34Gbit.

We believe it is time to allow TCP to pace high speed flows
on 64bit hosts, as we now can reach 100Gbit on one TCP flow.

This patch adds no cost for 32bit kernels.

The tcpi_pacing_rate and tcpi_max_pacing_rate were already
exported as 64bit, so iproute2/ss command require no changes.

Unfortunately the SO_MAX_PACING_RATE socket option will stay
32bit and we will need to add a new option to let applications
control high pacing rates.

State  Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB  0  1787144  10.246.9.76:49992 10.246.9.77:36741
 timer:(on,003ms,0) ino:91863 sk:2 <->
 skmem:(r0,rb54,t66440,tb2363904,f605944,w1822984,o0,bl0,d0)
 ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448
 rcvmss:536 advmss:1448
 cwnd:138 ssthresh:178 bytes_acked:256699822585 segs_out:177279177
 segs_in:3916318 data_segs_out:177279175
 bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2)
 send 28045.5Mbps lastrcv:7
 pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps
 busy:7ms unacked:135 retrans:0/157 rcv_space:14480
 notsent:2085120 minrtt:0.013

Signed-off-by: Eric Dumazet 
---
 include/net/sock.h|  4 ++--
 net/core/filter.c |  4 ++--
 net/core/sock.c   |  9 +
 net/ipv4/tcp.c| 10 +-
 net/ipv4/tcp_bbr.c|  6 +++---
 net/ipv4/tcp_output.c | 19 +++
 net/sched/sch_fq.c| 20 
 7 files changed, 40 insertions(+), 32 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 
751549ac0a849144ab0382203ee5c877374523e2..cfaf261936c8787b3a65ce832fd9c871697d00f4
 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -422,8 +422,8 @@ struct sock {
struct timer_list   sk_timer;
__u32   sk_priority;
__u32   sk_mark;
-   u32 sk_pacing_rate; /* bytes per second */
-   u32 sk_max_pacing_rate;
+   unsigned long   sk_pacing_rate; /* bytes per second */
+   unsigned long   sk_max_pacing_rate;
struct page_fragsk_frag;
netdev_features_t   sk_route_caps;
netdev_features_t   sk_route_nocaps;
diff --git a/net/core/filter.c b/net/core/filter.c
index 
4bbc6567fcb818e91617bfa9a2fd7fbebbd129f8..80da21b097b8d05eb7b9fa92afa86762334ac0ae
 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3927,8 +3927,8 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, 
bpf_sock,
sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
sk->sk_sndbuf = max_t(int, val * 2, SOCK_MIN_SNDBUF);
break;
-   case SO_MAX_PACING_RATE:
-   sk->sk_max_pacing_rate = val;
+   case SO_MAX_PACING_RATE: /* 32bit version */
+   sk->sk_max_pacing_rate = (val == ~0U) ? ~0UL : val;
sk->sk_pacing_rate = min(sk->sk_pacing_rate,
 sk->sk_max_pacing_rate);
break;
diff --git a/net/core/sock.c b/net/core/sock.c
index 
7e8796a6a0892efbb7dfce67d12b8062b2d5daa9..fdf9fc7d3f9875f2718575078a0f263674c80b4f
 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -998,7 +998,7 @@ int sock_setsockopt(struct socket *sock, int level, int 
optname,
cmpxchg(>sk_pacing_status,
SK_PACING_NONE,
SK_PACING_NEEDED);
-   sk->sk_max_pacing_rate = val;
+   sk->sk_max_pacing_rate = (val == ~0U) ? ~0UL : val;
sk->sk_pacing_rate = min(sk->sk_pacing_rate,
 sk->sk_max_pacing_rate);
break;
@@ -1336,7 +1336,8 @@ int sock_getsockopt(struct socket *sock, int level, int 
optname,
 #endif
 
case SO_MAX_PACING_RATE:
-   v.val = sk->sk_max_pacing_rate;
+   /* 32bit version */
+   v.val = min_t(unsigned long, sk->sk_max_pacing_rate, ~0U);
break;
 
case SO_INCOMING_CPU:
@@ -2810,8 +2811,8 @@ void sock_init_data(struct socket *sock, struct sock *sk)
sk->sk_ll_usec  =   sysctl_net_busy_read;
 #endif
 
-   sk->sk_max_pacing_rate = ~0U;
-   sk->sk_pacing_rate = ~0U;
+   sk->sk_max_pacing_rate = ~0UL;
+   sk->sk_pacing_rate = ~0UL;
sk->sk_pacing_shift = 10;
sk->sk_incoming_cpu = -1;
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 
43ef83b2330e6238a55c9843580a585d87708e0c..b8ba8fa34effac5138aea76b0d0fc2a9f1c05c4f
 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3111,10 +3111,10 @@ void tcp_get_info(struct sock *sk, struct tcp_info 
*info)
 {
const struct tcp_sock *tp = tcp_sk(sk); /* iff sk_type == SOCK_STREAM */
const struct inet_connection_sock

[PATCH net-next 6/7] tcp_bbr: fix typo in bbr_pacing_margin_percent

2018-10-15 Thread Eric Dumazet

From: Neal Cardwell 

There was a typo in this parameter name.

Signed-off-by: Neal Cardwell 
Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_bbr.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index 
33f4358615e6d63b5c98a30484f12ffae66334a2..b88081285fd172444a844b6aec5d038c0f882594
 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -129,7 +129,7 @@ static const u32 bbr_probe_rtt_mode_ms = 200;
 static const int bbr_min_tso_rate = 120;
 
 /* Pace at ~1% below estimated bw, on average, to reduce queue at bottleneck. 
*/
-static const int bbr_pacing_marging_percent = 1;
+static const int bbr_pacing_margin_percent = 1;
 
 /* We use a high_gain value of 2/ln(2) because it's the smallest pacing gain
  * that will allow a smoothly increasing pacing rate that will double each RTT
@@ -214,7 +214,7 @@ static u64 bbr_rate_bytes_per_sec(struct sock *sk, u64 
rate, int gain)
rate *= mss;
rate *= gain;
rate >>= BBR_SCALE;
-   rate *= USEC_PER_SEC / 100 * (100 - bbr_pacing_marging_percent);
+   rate *= USEC_PER_SEC / 100 * (100 - bbr_pacing_margin_percent);
return rate >> BW_SCALE;
 }
 
-- 
2.19.0.605.g01d371f741-goog

[PATCH net-next 4/7] net_sched: sch_fq: no longer use skb_is_tcp_pure_ack()

2018-10-15 Thread Eric Dumazet

With the new EDT model, sch_fq no longer has to special
case TCP pure acks, since their skb->tstamp will allow them
being sent without pacing delay.

Signed-off-by: Eric Dumazet 
---
 net/sched/sch_fq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index 
3923d14095335df61c270f69e50cb7cbfde4c796..4b1af706896c07e5a0fe6d542dfcd530acdcf8f5
 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -444,7 +444,7 @@ static struct sk_buff *fq_dequeue(struct Qdisc *sch)
}
 
skb = f->head;
-   if (skb && !skb_is_tcp_pure_ack(skb)) {
+   if (skb) {
u64 time_next_packet = max_t(u64, ktime_to_ns(skb->tstamp),
 f->time_next_packet);
 
-- 
2.19.0.605.g01d371f741-goog

[PATCH net-next 5/7] tcp: optimize tcp internal pacing

2018-10-15 Thread Eric Dumazet

When TCP implements its own pacing (when no fq packet scheduler is used),
it is arming high resolution timer after a packet is sent.

But in many cases (like TCP_RR kind of workloads), this high resolution
timer expires before the application attempts to write the following
packet. This overhead also happens when the flow is ACK clocked and
cwnd limited instead of being limited by the pacing rate.

This leads to extra overhead (high number of IRQ)

Now tcp_wstamp_ns is reserved for the pacing timer only
(after commit "tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh"),
we can setup the timer only when a packet is about to be sent,
and if tcp_wstamp_ns is in the future.

This leads to a ~10% performance increase in TCP_RR workloads.

Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_output.c | 31 ---
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 
5474c9854f252e50cdb1136435417873861d7618..d212e4cbc68902e873afb4a12b43b467ccd6069b
 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -975,16 +975,6 @@ enum hrtimer_restart tcp_pace_kick(struct hrtimer *timer)
return HRTIMER_NORESTART;
 }
 
-static void tcp_internal_pacing(struct sock *sk)
-{
-   if (!tcp_needs_internal_pacing(sk))
-   return;
-   hrtimer_start(_sk(sk)->pacing_timer,
- ns_to_ktime(tcp_sk(sk)->tcp_wstamp_ns),
- HRTIMER_MODE_ABS_PINNED_SOFT);
-   sock_hold(sk);
-}
-
 static void tcp_update_skb_after_send(struct sock *sk, struct sk_buff *skb,
  u64 prior_wstamp)
 {
@@ -1005,8 +995,6 @@ static void tcp_update_skb_after_send(struct sock *sk, 
struct sk_buff *skb,
/* take into account OS jitter */
len_ns -= min_t(u64, len_ns / 2, credit);
tp->tcp_wstamp_ns += len_ns;
-
-   tcp_internal_pacing(sk);
}
}
list_move_tail(>tcp_tsorted_anchor, >tsorted_sent_queue);
@@ -2186,10 +2174,23 @@ static int tcp_mtu_probe(struct sock *sk)
return -1;
 }
 
-static bool tcp_pacing_check(const struct sock *sk)
+static bool tcp_pacing_check(struct sock *sk)
 {
-   return tcp_needs_internal_pacing(sk) &&
-  hrtimer_is_queued(_sk(sk)->pacing_timer);
+   struct tcp_sock *tp = tcp_sk(sk);
+
+   if (!tcp_needs_internal_pacing(sk))
+   return false;
+
+   if (tp->tcp_wstamp_ns <= tp->tcp_clock_cache)
+   return false;
+
+   if (!hrtimer_is_queued(>pacing_timer)) {
+   hrtimer_start(>pacing_timer,
+ ns_to_ktime(tp->tcp_wstamp_ns),
+ HRTIMER_MODE_ABS_PINNED_SOFT);
+   sock_hold(sk);
+   }
+   return true;
 }
 
 /* TCP Small Queues :
-- 
2.19.0.605.g01d371f741-goog

[PATCH net-next 1/7] tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh

2018-10-15 Thread Eric Dumazet

In EDT design, I made the mistake of using tcp_wstamp_ns
to store the last tcp_clock_ns() sample and to store the
pacing virtual timer.

This causes major regressions at high speed flows.

Introduce tcp_clock_cache to store last tcp_clock_ns().
This is needed because some arches have slow high-resolution
kernel time service.

tcp_wstamp_ns is only updated when a packet is sent.

Note that we can remove tcp_mstamp in the future since
tcp_mstamp is essentially tcp_clock_cache/1000, so the
apparent socket size increase is temporary.

Fixes: 9799ccb0e984 ("tcp: add tcp_wstamp_ns socket field")
Signed-off-by: Eric Dumazet 
Acked-by: Soheil Hassas Yeganeh 
---
 include/linux/tcp.h   | 1 +
 net/ipv4/tcp_output.c | 9 ++---
 net/ipv4/tcp_timer.c  | 2 +-
 3 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 
848f5b25e178288ce870637b68a692ab88dc7d4d..8ed77bb4ed8636e9294389a011529fd9a667dce4
 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -249,6 +249,7 @@ struct tcp_sock {
u32 tlp_high_seq;   /* snd_nxt at the time of TLP retransmit. */
 
u64 tcp_wstamp_ns;  /* departure time for next sent data packet */
+   u64 tcp_clock_cache; /* cache last tcp_clock_ns() (see 
tcp_mstamp_refresh()) */
 
 /* RTT measurement */
u64 tcp_mstamp; /* most recent packet received/sent */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 
059b67af28b137fb9566eaef370b270fc424bffb..f14df66a0c858dcb22b8924b9691c375eb5fcbc5
 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -52,9 +52,8 @@ void tcp_mstamp_refresh(struct tcp_sock *tp)
 {
u64 val = tcp_clock_ns();
 
-   /* departure time for next data packet */
-   if (val > tp->tcp_wstamp_ns)
-   tp->tcp_wstamp_ns = val;
+   if (val > tp->tcp_clock_cache)
+   tp->tcp_clock_cache = val;
 
val = div_u64(val, NSEC_PER_USEC);
if (val > tp->tcp_mstamp)
@@ -1050,6 +1049,10 @@ static int __tcp_transmit_skb(struct sock *sk, struct 
sk_buff *skb,
if (unlikely(!skb))
return -ENOBUFS;
}
+
+   /* TODO: might take care of jitter here */
+   tp->tcp_wstamp_ns = max(tp->tcp_wstamp_ns, tp->tcp_clock_cache);
+
skb->skb_mstamp_ns = tp->tcp_wstamp_ns;
 
inet = inet_sk(sk);
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 
61023d50cd604d5e19464a32c33b65d29c75c81e..676020663ce80a79341ad1a05352742cc8dd5850
 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -360,7 +360,7 @@ static void tcp_probe_timer(struct sock *sk)
 */
start_ts = tcp_skb_timestamp(skb);
if (!start_ts)
-   skb->skb_mstamp_ns = tp->tcp_wstamp_ns;
+   skb->skb_mstamp_ns = tp->tcp_clock_cache;
else if (icsk->icsk_user_timeout &&
 (s32)(tcp_time_stamp(tp) - start_ts) > icsk->icsk_user_timeout)
goto abort;
-- 
2.19.0.605.g01d371f741-goog

[PATCH net-next 0/7] tcp: second round for EDT conversion

2018-10-15 Thread Eric Dumazet

First round of EDT patches left TCP stack in a non optimal state.

- High speed flows suffered from loss of performance, addressed
  by the first patch of this series.

- Second patch brings pacing to the current state of networking,
  since we now reach ~100 Gbit on a single TCP flow.

- Third patch implements a mitigation for scheduling delays,
  like the one we did in sch_fq in the past.

- Fourth patch removes one special case in sch_fq for ACK packets.

- Fifth patch removes a serious perfomance cost for TCP internal
  pacing. We should setup the high resolution timer only if
  really needed.

- Sixth patch fixes a typo in BBR.

- Last patch is one minor change in cdg congestion control.

Neal Cardwell also has a patch series fixing BBR after
EDT adoption.

Eric Dumazet (6):
  tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh
  net: extend sk_pacing_rate to unsigned long
  tcp: mitigate scheduling jitter in EDT pacing model
  net_sched: sch_fq: no longer use skb_is_tcp_pure_ack()
  tcp: optimize tcp internal pacing
  tcp: cdg: use tcp high resolution clock cache

Neal Cardwell (1):
  tcp_bbr: fix typo in bbr_pacing_margin_percent

 include/linux/tcp.h   |  1 +
 include/net/sock.h|  4 +--
 net/core/filter.c |  4 +--
 net/core/sock.c   |  9 +++---
 net/ipv4/tcp.c| 10 +++---
 net/ipv4/tcp_bbr.c| 10 +++---
 net/ipv4/tcp_cdg.c|  2 +-
 net/ipv4/tcp_output.c | 72 ++-
 net/ipv4/tcp_timer.c  |  2 +-
 net/sched/sch_fq.c| 22 +++--
 10 files changed, 78 insertions(+), 58 deletions(-)

-- 
2.19.0.605.g01d371f741-goog

Re: [PATCH iproute2] macsec: fix off-by-one when parsing attributes

2018-10-15 Thread Stephen Hemminger

On Fri, 12 Oct 2018 17:34:12 +0200
Sabrina Dubroca  wrote:

> I seem to have had a massive brainfart with uses of
> parse_rtattr_nested(). The rtattr* array must have MAX+1 elements, and
> the call to parse_rtattr_nested must have MAX as its bound. Let's fix
> those.
> 
> Fixes: b26fc590ce62 ("ip: add MACsec support")
> Signed-off-by: Sabrina Dubroca 

Applied,
How did it ever work??

Re: [PATCH iproute2] json: make 0xhex handle u64

2018-10-15 Thread Stephen Hemminger

On Fri, 12 Oct 2018 17:34:32 +0200
Sabrina Dubroca  wrote:

> Stephen converted macsec's sci to use 0xhex, but 0xhex handles
> unsigned int's, not 64 bits ints. Thus, the output of the "ip macsec
> show" command is mangled, with half of the SCI replaced with 0s:
> 
> # ip macsec show
> 11: macsec0: [...]
> cipher suite: GCM-AES-128, using ICV length 16
> TXSC: 01560001 on SA 0
> 
> # ip -d link show macsec0
> 11: macsec0@ens3: [...]
> link/ether 52:54:00:12:01:56 brd ff:ff:ff:ff:ff:ff promiscuity 0 
> macsec sci 5254001201560001 [...]
> 
> where TXSC and sci should match.
> 
> Fixes: c0b904de6211 ("macsec: support JSON")
> Signed-off-by: Sabrina Dubroca 

Thanks for finding this. We should add JSON (and macsec) to tests.

Re: [iproute PATCH] bridge: fdb: Fix for missing keywords in non-JSON output

2018-10-15 Thread Stephen Hemminger

On Tue,  9 Oct 2018 14:44:08 +0200
Phil Sutter  wrote:

> While migrating to JSON print library, some keywords were dropped from
> standard output by accident. Add them back to unbreak output parsers.
> 
> Fixes: c7c1a1ef51aea ("bridge: colorize output and use JSON print library")
> Signed-off-by: Phil Sutter 

Good catch. Applied.

Re: BBR and TCP internal pacing causing interrupt storm with pfifo_fast

2018-10-15 Thread Eric Dumazet

On 10/15/2018 07:50 AM, Eric Dumazet wrote:
> On Mon, Oct 15, 2018 at 3:26 AM Gasper Zejn  wrote:
>>
>>
>> I've tried to isolate the issue as best I could. There seems to be an
>> issue if the TCP socket has keepalive set and send queue is not empty
>> and the route goes away.
>>
>> https://github.com/zejn/bbr_pfifo_interrupts_issue
>>
>> Hope this helps,
>> Gasper
> 
> This is awesome Gasper, I will take a look thanks.
> 
> Note that we are about to send a patch series (targeting net-next) to
> polish the EDT patch series that was merged last month for linux-4.20.
> TCP internal pacing is going to be much better performance-wise.
> 

Yeah, I believe that :

Commit c092dd5f4a7f4e4dbbcc8cf2e50b516bf07e432f ("tcp: switch
tcp_internal_pacing() to tcp_wstamp_ns")
has incidentally fixed the issue.

That is because it calls tcp_internal_pacing() from
tcp_update_skb_after_send() which is called only if the packet was
correctly sent by IP layer.

Before this patch, tcp_internal_pacing() was called from
__tcp_transmit_skb() before we attempted to send the clone
and the clone could be dropped in IP layer (lack of route for example)
right away.

So in case the packet was not sent because of a route problem, the high 
resolution
timer would kick soon after and TCP xmit path would be entered again, 
triggering this loop problem.

I am going to send the 2nd round of EDT patches, so that you can try David 
Miller net-next tree
with all the patches we believe are needed for 4.20. Once proven to work, we 
might have to backport
the series to 4.18 and 4.19

Thanks !

Re: [Bug 201423] New: eth0: hw csum failure

2018-10-15 Thread Stephen Hemminger

On Mon, 15 Oct 2018 08:41:47 -0700
Eric Dumazet  wrote:

> On Mon, Oct 15, 2018 at 8:15 AM Stephen Hemminger
>  wrote:
> >
> >
> >
> > Begin forwarded message:
> >
> > Date: Sun, 14 Oct 2018 10:42:48 +
> > From: bugzilla-dae...@bugzilla.kernel.org
> > To: step...@networkplumber.org
> > Subject: [Bug 201423] New: eth0: hw csum failure
> >
> >
> > https://bugzilla.kernel.org/show_bug.cgi?id=201423
> >
> > Bug ID: 201423
> >Summary: eth0: hw csum failure
> >Product: Networking
> >Version: 2.5
> > Kernel Version: 4.19.0-rc7
> >   Hardware: Intel
> > OS: Linux
> >   Tree: Mainline
> > Status: NEW
> >   Severity: normal
> >   Priority: P1
> >  Component: Other
> >   Assignee: step...@networkplumber.org
> >   Reporter: ross...@inwind.it
> > Regression: No
> >
> > I have a P6T DELUXE V2 motherboard and using the sky2 driver for the 
> > ethernet
> > ports. I get the following error message:
> >
> > [  433.727397] eth0: hw csum failure
> > [  433.727406] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.19.0-rc7 #19
> > [  433.727406] Hardware name: System manufacturer System Product Name/P6T
> > DELUXE V2, BIOS 120212/22/2010
> > [  433.727407] Call Trace:
> > [  433.727409]  
> > [  433.727415]  dump_stack+0x46/0x5b
> > [  433.727419]  __skb_checksum_complete+0xb0/0xc0
> > [  433.727423]  tcp_v4_rcv+0x528/0xb60
> > [  433.727426]  ? ipt_do_table+0x2d0/0x400
> > [  433.727429]  ip_local_deliver_finish+0x5a/0x110
> > [  433.727430]  ip_local_deliver+0xe1/0xf0
> > [  433.727431]  ? ip_sublist_rcv_finish+0x60/0x60
> > [  433.727432]  ip_rcv+0xca/0xe0
> > [  433.727434]  ? ip_rcv_finish_core.isra.0+0x300/0x300
> > [  433.727436]  __netif_receive_skb_one_core+0x4b/0x70
> > [  433.727438]  netif_receive_skb_internal+0x4e/0x130
> > [  433.727439]  napi_gro_receive+0x6a/0x80
> > [  433.727442]  sky2_poll+0x707/0xd20
> > [  433.727446]  ? rcu_check_callbacks+0x1b4/0x900
> > [  433.727447]  net_rx_action+0x237/0x380
> > [  433.727449]  __do_softirq+0xdc/0x1e0
> > [  433.727452]  irq_exit+0xa9/0xb0
> > [  433.727453]  do_IRQ+0x45/0xc0
> > [  433.727455]  common_interrupt+0xf/0xf
> > [  433.727456]  
> > [  433.727459] RIP: 0010:cpuidle_enter_state+0x124/0x200
> > [  433.727461] Code: 53 60 89 c3 e8 dd 90 ad ff 65 8b 3d 96 58 a7 7e e8 d1 
> > 8f
> > ad ff 31 ff 49 89 c4 e8 27 99 ad ff fb 48 ba cf f7 53 e3 a5 9b c4 20 <4c> 
> > 89 e1
> > 4c 29 e9 48 89 c8 48 c1 f9 3f 48 f7 ea b8 ff ff ff 7f 48
> > [  433.727462] RSP: :c90a3e98 EFLAGS: 0282 ORIG_RAX:
> > ffde
> > [  433.727463] RAX: 880237b1f280 RBX: 0004 RCX:
> > 001f
> > [  433.727464] RDX: 20c49ba5e353f7cf RSI: 2fe419c1 RDI:
> > 
> > [  433.727465] RBP: 880237b263a0 R08: 0714 R09:
> > 00650512105d
> > [  433.727465] R10:  R11: 0342 R12:
> > 0064fc2a8b1c
> > [  433.727466] R13: 0064fc25b35f R14: 0004 R15:
> > 8204af20
> > [  433.727468]  ? cpuidle_enter_state+0x119/0x200
> > [  433.727471]  do_idle+0x1bf/0x200
> > [  433.727473]  cpu_startup_entry+0x6a/0x70
> > [  433.727475]  start_secondary+0x17f/0x1c0
> > [  433.727476]  secondary_startup_64+0xa4/0xb0
> > [  441.662954] eth0: hw csum failure
> > [  441.662959] CPU: 4 PID: 4347 Comm: radeon_cs:0 Not tainted 4.19.0-rc7 #19
> > [  441.662960] Hardware name: System manufacturer System Product Name/P6T
> > DELUXE V2, BIOS 120212/22/2010
> > [  441.662960] Call Trace:
> > [  441.662963]  
> > [  441.662968]  dump_stack+0x46/0x5b
> > [  441.662972]  __skb_checksum_complete+0xb0/0xc0
> > [  441.662975]  tcp_v4_rcv+0x528/0xb60
> > [  441.662979]  ? ipt_do_table+0x2d0/0x400
> > [  441.662981]  ip_local_deliver_finish+0x5a/0x110
> > [  441.662983]  ip_local_deliver+0xe1/0xf0
> > [  441.662985]  ? ip_sublist_rcv_finish+0x60/0x60
> > [  441.662986]  ip_rcv+0xca/0xe0
> > [  441.662988]  ? ip_rcv_finish_core.isra.0+0x300/0x300
> > [  441.662990]  __netif_receive_skb_one_core+0x4b/0x70
> > [  441.662993]  netif_receive_skb_internal+0x4e/0x130
> > [  441.662994]  napi_gro_receive+0x6a/0x80
> > [  441.662998]  sky2_poll+0x707/0xd20
> > [  441.663000]  net_rx_action+0x237/0x380
> > [  441.663002]  __do_softirq+0xdc/0x1e0
> > [  441.663005]  irq_exit+0xa9/0xb0
> > [  441.663007]  do_IRQ+0x45/0xc0
> > [  441.663009]  common_interrupt+0xf/0xf
> > [  441.663010]  
> > [  441.663012] RIP: 0010:merge+0x22/0xb0
> > [  441.663014] Code: c3 31 c0 c3 90 90 90 90 41 56 41 55 41 54 55 48 89 d5 
> > 53
> > 48 89 cb 48 83 ec 18 65 48 8b 04 25 28 00 00 00 48 89 44 24 10 31 c0 <48> 
> > 85 c9
> > 74 70 48 85 d2 74 6b 49 89 fd 49 89 f6 49 89 e4 eb 14 48
> > [  441.663015] RSP: 0018:c990b988 EFLAGS: 0246 ORIG_RAX:
> > ffde
> > [  441.663017] RAX:  RBX: 88021ab2d408 RCX:
> > 88021ab2d408
> > [  441.663018]

Re: Fw: [Bug 201423] New: eth0: hw csum failure

2018-10-15 Thread Dave Stevenson

Hi Eric.

On Mon, 15 Oct 2018 at 16:42, Eric Dumazet  wrote:
>
> On Mon, Oct 15, 2018 at 8:15 AM Stephen Hemminger
>  wrote:
> >
> >
> >
> > Begin forwarded message:
> >
> > Date: Sun, 14 Oct 2018 10:42:48 +
> > From: bugzilla-dae...@bugzilla.kernel.org
> > To: step...@networkplumber.org
> > Subject: [Bug 201423] New: eth0: hw csum failure
> >
> >
> > https://bugzilla.kernel.org/show_bug.cgi?id=201423
> >
> > Bug ID: 201423
> >Summary: eth0: hw csum failure
> >Product: Networking
> >Version: 2.5
> > Kernel Version: 4.19.0-rc7
> >   Hardware: Intel
> > OS: Linux
> >   Tree: Mainline
> > Status: NEW
> >   Severity: normal
> >   Priority: P1
> >  Component: Other
> >   Assignee: step...@networkplumber.org
> >   Reporter: ross...@inwind.it
> > Regression: No
> >
> > I have a P6T DELUXE V2 motherboard and using the sky2 driver for the 
> > ethernet
> > ports. I get the following error message:
> >
> > [  433.727397] eth0: hw csum failure
> > [  433.727406] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.19.0-rc7 #19
> > [  433.727406] Hardware name: System manufacturer System Product Name/P6T
> > DELUXE V2, BIOS 120212/22/2010
> > [  433.727407] Call Trace:
> > [  433.727409]  
> > [  433.727415]  dump_stack+0x46/0x5b
> > [  433.727419]  __skb_checksum_complete+0xb0/0xc0
> > [  433.727423]  tcp_v4_rcv+0x528/0xb60
> > [  433.727426]  ? ipt_do_table+0x2d0/0x400
> > [  433.727429]  ip_local_deliver_finish+0x5a/0x110
> > [  433.727430]  ip_local_deliver+0xe1/0xf0
> > [  433.727431]  ? ip_sublist_rcv_finish+0x60/0x60
> > [  433.727432]  ip_rcv+0xca/0xe0
> > [  433.727434]  ? ip_rcv_finish_core.isra.0+0x300/0x300
> > [  433.727436]  __netif_receive_skb_one_core+0x4b/0x70
> > [  433.727438]  netif_receive_skb_internal+0x4e/0x130
> > [  433.727439]  napi_gro_receive+0x6a/0x80
> > [  433.727442]  sky2_poll+0x707/0xd20
> > [  433.727446]  ? rcu_check_callbacks+0x1b4/0x900
> > [  433.727447]  net_rx_action+0x237/0x380
> > [  433.727449]  __do_softirq+0xdc/0x1e0
> > [  433.727452]  irq_exit+0xa9/0xb0
> > [  433.727453]  do_IRQ+0x45/0xc0
> > [  433.727455]  common_interrupt+0xf/0xf
> > [  433.727456]  
> > [  433.727459] RIP: 0010:cpuidle_enter_state+0x124/0x200
> > [  433.727461] Code: 53 60 89 c3 e8 dd 90 ad ff 65 8b 3d 96 58 a7 7e e8 d1 
> > 8f
> > ad ff 31 ff 49 89 c4 e8 27 99 ad ff fb 48 ba cf f7 53 e3 a5 9b c4 20 <4c> 
> > 89 e1
> > 4c 29 e9 48 89 c8 48 c1 f9 3f 48 f7 ea b8 ff ff ff 7f 48
> > [  433.727462] RSP: :c90a3e98 EFLAGS: 0282 ORIG_RAX:
> > ffde
> > [  433.727463] RAX: 880237b1f280 RBX: 0004 RCX:
> > 001f
> > [  433.727464] RDX: 20c49ba5e353f7cf RSI: 2fe419c1 RDI:
> > 
> > [  433.727465] RBP: 880237b263a0 R08: 0714 R09:
> > 00650512105d
> > [  433.727465] R10:  R11: 0342 R12:
> > 0064fc2a8b1c
> > [  433.727466] R13: 0064fc25b35f R14: 0004 R15:
> > 8204af20
> > [  433.727468]  ? cpuidle_enter_state+0x119/0x200
> > [  433.727471]  do_idle+0x1bf/0x200
> > [  433.727473]  cpu_startup_entry+0x6a/0x70
> > [  433.727475]  start_secondary+0x17f/0x1c0
> > [  433.727476]  secondary_startup_64+0xa4/0xb0
> > [  441.662954] eth0: hw csum failure
> > [  441.662959] CPU: 4 PID: 4347 Comm: radeon_cs:0 Not tainted 4.19.0-rc7 #19
> > [  441.662960] Hardware name: System manufacturer System Product Name/P6T
> > DELUXE V2, BIOS 120212/22/2010
> > [  441.662960] Call Trace:
> > [  441.662963]  
> > [  441.662968]  dump_stack+0x46/0x5b
> > [  441.662972]  __skb_checksum_complete+0xb0/0xc0
> > [  441.662975]  tcp_v4_rcv+0x528/0xb60
> > [  441.662979]  ? ipt_do_table+0x2d0/0x400
> > [  441.662981]  ip_local_deliver_finish+0x5a/0x110
> > [  441.662983]  ip_local_deliver+0xe1/0xf0
> > [  441.662985]  ? ip_sublist_rcv_finish+0x60/0x60
> > [  441.662986]  ip_rcv+0xca/0xe0
> > [  441.662988]  ? ip_rcv_finish_core.isra.0+0x300/0x300
> > [  441.662990]  __netif_receive_skb_one_core+0x4b/0x70
> > [  441.662993]  netif_receive_skb_internal+0x4e/0x130
> > [  441.662994]  napi_gro_receive+0x6a/0x80
> > [  441.662998]  sky2_poll+0x707/0xd20
> > [  441.663000]  net_rx_action+0x237/0x380
> > [  441.663002]  __do_softirq+0xdc/0x1e0
> > [  441.663005]  irq_exit+0xa9/0xb0
> > [  441.663007]  do_IRQ+0x45/0xc0
> > [  441.663009]  common_interrupt+0xf/0xf
> > [  441.663010]  
> > [  441.663012] RIP: 0010:merge+0x22/0xb0
> > [  441.663014] Code: c3 31 c0 c3 90 90 90 90 41 56 41 55 41 54 55 48 89 d5 
> > 53
> > 48 89 cb 48 83 ec 18 65 48 8b 04 25 28 00 00 00 48 89 44 24 10 31 c0 <48> 
> > 85 c9
> > 74 70 48 85 d2 74 6b 49 89 fd 49 89 f6 49 89 e4 eb 14 48
> > [  441.663015] RSP: 0018:c990b988 EFLAGS: 0246 ORIG_RAX:
> > ffde
> > [  441.663017] RAX:  RBX: 88021ab2d408 RCX:
> > 88021ab2d408
> > [

Re: [bpf-next PATCH v2 2/2] bpf: bpftool, add flag to allow non-compat map definitions

2018-10-15 Thread Jakub Kicinski

On Mon, 15 Oct 2018 08:17:53 -0700, John Fastabend wrote:
> Multiple map definition structures exist and user may have non-zero
> fields in their definition that are not recognized by bpftool and
> libbpf. The normal behavior is to then fail loading the map. Although
> this is a good default behavior users may still want to load the map
> for debugging or other reasons. This patch adds a --mapcompat flag
> that can be used to override the default behavior and allow loading
> the map even when it has additional non-zero fields.
> 
> For now the only user is 'bpftool prog' we can switch over other
> subcommands as needed. The library exposes an API that consumes
> a flags field now but I kept the original API around also in case
> users of the API don't want to expose this. The flags field is an
> int in case we need more control over how the API call handles
> errors/features/etc in the future.
> 
> Signed-off-by: John Fastabend 

No strong opinion on the functionality, but may I be a grump and again
request adding the new option to completions and the man page? :)

Re: [bpf-next PATCH v2 1/2] bpf: bpftool, add support for attaching programs to maps

2018-10-15 Thread Jakub Kicinski

On Mon, 15 Oct 2018 08:17:48 -0700, John Fastabend wrote:
> Sock map/hash introduce support for attaching programs to maps. To
> date I have been doing this with custom tooling but this is less than
> ideal as we shift to using bpftool as the single CLI for our BPF uses.
> This patch adds new sub commands 'attach' and 'detach' to the 'prog'
> command to attach programs to maps and then detach them.
> 
> Signed-off-by: John Fastabend 

Reviewed-by: Jakub Kicinski

Re: Bug in MACSec - stops passing traffic after approx 5TB

2018-10-15 Thread Josh Coombs

And confirmed, starting with a high packet number results in a very
short testbed run, 296 packets and then nothing, just as you surmised.
Sorry for raising the alarm falsely.  Looks like I need to roll my own
build of wpa_supplicant as the ubuntu builds don't include the macsec
driver, haven't tested Gentoo's ebuilds yet to see if they do.

Josh Coombs

On Sun, Oct 14, 2018 at 4:52 PM Josh Coombs  wrote:
>
> On Sun, Oct 14, 2018 at 4:24 PM Sabrina Dubroca  wrote:
> >
> > 2018-10-14, 10:59:31 -0400, Josh Coombs wrote:
> > > I initially mistook this for a traffic control issue, but after
> > > stripping the test beds down to just the MACSec component, I can still
> > > replicate the issue.  After approximately 5TB of transfer / 4 billion
> > > packets over a MACSec link it stops passing traffic.
> >
> > I think you're just hitting packet number exhaustion. After 2^32
> > packets, the packet number would wrap to 0 and start being reused,
> > which breaks the crypto used by macsec. Before this point, you have to
> > add a new SA, and tell the macsec device to switch to it.
>
> I had not considered that, I naively thought as long as I didn't
> specify a replay window, it'd roll the PN over on it's own and life
> would be good.  I'll test that theory tomorrow, should be easy to
> prove out.
>
> > That's why you should be using wpa_supplicant. It will monitor the
> > growth of the packet number, and handle the rekey for you.
>
> Thank you for the heads up, I'll read up on this as well.
>
> Josh C

Re: Fw: [Bug 201423] New: eth0: hw csum failure

2018-10-15 Thread Eric Dumazet

On Mon, Oct 15, 2018 at 8:15 AM Stephen Hemminger
 wrote:
>
>
>
> Begin forwarded message:
>
> Date: Sun, 14 Oct 2018 10:42:48 +
> From: bugzilla-dae...@bugzilla.kernel.org
> To: step...@networkplumber.org
> Subject: [Bug 201423] New: eth0: hw csum failure
>
>
> https://bugzilla.kernel.org/show_bug.cgi?id=201423
>
> Bug ID: 201423
>Summary: eth0: hw csum failure
>Product: Networking
>Version: 2.5
> Kernel Version: 4.19.0-rc7
>   Hardware: Intel
> OS: Linux
>   Tree: Mainline
> Status: NEW
>   Severity: normal
>   Priority: P1
>  Component: Other
>   Assignee: step...@networkplumber.org
>   Reporter: ross...@inwind.it
> Regression: No
>
> I have a P6T DELUXE V2 motherboard and using the sky2 driver for the ethernet
> ports. I get the following error message:
>
> [  433.727397] eth0: hw csum failure
> [  433.727406] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.19.0-rc7 #19
> [  433.727406] Hardware name: System manufacturer System Product Name/P6T
> DELUXE V2, BIOS 120212/22/2010
> [  433.727407] Call Trace:
> [  433.727409]  
> [  433.727415]  dump_stack+0x46/0x5b
> [  433.727419]  __skb_checksum_complete+0xb0/0xc0
> [  433.727423]  tcp_v4_rcv+0x528/0xb60
> [  433.727426]  ? ipt_do_table+0x2d0/0x400
> [  433.727429]  ip_local_deliver_finish+0x5a/0x110
> [  433.727430]  ip_local_deliver+0xe1/0xf0
> [  433.727431]  ? ip_sublist_rcv_finish+0x60/0x60
> [  433.727432]  ip_rcv+0xca/0xe0
> [  433.727434]  ? ip_rcv_finish_core.isra.0+0x300/0x300
> [  433.727436]  __netif_receive_skb_one_core+0x4b/0x70
> [  433.727438]  netif_receive_skb_internal+0x4e/0x130
> [  433.727439]  napi_gro_receive+0x6a/0x80
> [  433.727442]  sky2_poll+0x707/0xd20
> [  433.727446]  ? rcu_check_callbacks+0x1b4/0x900
> [  433.727447]  net_rx_action+0x237/0x380
> [  433.727449]  __do_softirq+0xdc/0x1e0
> [  433.727452]  irq_exit+0xa9/0xb0
> [  433.727453]  do_IRQ+0x45/0xc0
> [  433.727455]  common_interrupt+0xf/0xf
> [  433.727456]  
> [  433.727459] RIP: 0010:cpuidle_enter_state+0x124/0x200
> [  433.727461] Code: 53 60 89 c3 e8 dd 90 ad ff 65 8b 3d 96 58 a7 7e e8 d1 8f
> ad ff 31 ff 49 89 c4 e8 27 99 ad ff fb 48 ba cf f7 53 e3 a5 9b c4 20 <4c> 89 
> e1
> 4c 29 e9 48 89 c8 48 c1 f9 3f 48 f7 ea b8 ff ff ff 7f 48
> [  433.727462] RSP: :c90a3e98 EFLAGS: 0282 ORIG_RAX:
> ffde
> [  433.727463] RAX: 880237b1f280 RBX: 0004 RCX:
> 001f
> [  433.727464] RDX: 20c49ba5e353f7cf RSI: 2fe419c1 RDI:
> 
> [  433.727465] RBP: 880237b263a0 R08: 0714 R09:
> 00650512105d
> [  433.727465] R10:  R11: 0342 R12:
> 0064fc2a8b1c
> [  433.727466] R13: 0064fc25b35f R14: 0004 R15:
> 8204af20
> [  433.727468]  ? cpuidle_enter_state+0x119/0x200
> [  433.727471]  do_idle+0x1bf/0x200
> [  433.727473]  cpu_startup_entry+0x6a/0x70
> [  433.727475]  start_secondary+0x17f/0x1c0
> [  433.727476]  secondary_startup_64+0xa4/0xb0
> [  441.662954] eth0: hw csum failure
> [  441.662959] CPU: 4 PID: 4347 Comm: radeon_cs:0 Not tainted 4.19.0-rc7 #19
> [  441.662960] Hardware name: System manufacturer System Product Name/P6T
> DELUXE V2, BIOS 120212/22/2010
> [  441.662960] Call Trace:
> [  441.662963]  
> [  441.662968]  dump_stack+0x46/0x5b
> [  441.662972]  __skb_checksum_complete+0xb0/0xc0
> [  441.662975]  tcp_v4_rcv+0x528/0xb60
> [  441.662979]  ? ipt_do_table+0x2d0/0x400
> [  441.662981]  ip_local_deliver_finish+0x5a/0x110
> [  441.662983]  ip_local_deliver+0xe1/0xf0
> [  441.662985]  ? ip_sublist_rcv_finish+0x60/0x60
> [  441.662986]  ip_rcv+0xca/0xe0
> [  441.662988]  ? ip_rcv_finish_core.isra.0+0x300/0x300
> [  441.662990]  __netif_receive_skb_one_core+0x4b/0x70
> [  441.662993]  netif_receive_skb_internal+0x4e/0x130
> [  441.662994]  napi_gro_receive+0x6a/0x80
> [  441.662998]  sky2_poll+0x707/0xd20
> [  441.663000]  net_rx_action+0x237/0x380
> [  441.663002]  __do_softirq+0xdc/0x1e0
> [  441.663005]  irq_exit+0xa9/0xb0
> [  441.663007]  do_IRQ+0x45/0xc0
> [  441.663009]  common_interrupt+0xf/0xf
> [  441.663010]  
> [  441.663012] RIP: 0010:merge+0x22/0xb0
> [  441.663014] Code: c3 31 c0 c3 90 90 90 90 41 56 41 55 41 54 55 48 89 d5 53
> 48 89 cb 48 83 ec 18 65 48 8b 04 25 28 00 00 00 48 89 44 24 10 31 c0 <48> 85 
> c9
> 74 70 48 85 d2 74 6b 49 89 fd 49 89 f6 49 89 e4 eb 14 48
> [  441.663015] RSP: 0018:c990b988 EFLAGS: 0246 ORIG_RAX:
> ffde
> [  441.663017] RAX:  RBX: 88021ab2d408 RCX:
> 88021ab2d408
> [  441.663018] RDX: 88021ab2d388 RSI: a021c440 RDI:
> 
> [  441.663019] RBP: 88021ab2d388 R08: 5ecf R09:
> 8500
> [  441.663020] R10: ea000877ec00 R11: 880236803500 R12:
> a021c440
> [  441.663021] R13: 88021ab2d448 R14: 0004 R15:
>

[bpf-next PATCH v2 2/2] bpf: bpftool, add flag to allow non-compat map definitions

2018-10-15 Thread John Fastabend

Multiple map definition structures exist and user may have non-zero
fields in their definition that are not recognized by bpftool and
libbpf. The normal behavior is to then fail loading the map. Although
this is a good default behavior users may still want to load the map
for debugging or other reasons. This patch adds a --mapcompat flag
that can be used to override the default behavior and allow loading
the map even when it has additional non-zero fields.

For now the only user is 'bpftool prog' we can switch over other
subcommands as needed. The library exposes an API that consumes
a flags field now but I kept the original API around also in case
users of the API don't want to expose this. The flags field is an
int in case we need more control over how the API call handles
errors/features/etc in the future.

Signed-off-by: John Fastabend 
---
 tools/bpf/bpftool/main.c |7 ++-
 tools/bpf/bpftool/main.h |3 ++-
 tools/bpf/bpftool/prog.c |2 +-
 tools/lib/bpf/bpf.h  |3 +++
 tools/lib/bpf/libbpf.c   |   27 ++-
 tools/lib/bpf/libbpf.h   |2 ++
 6 files changed, 32 insertions(+), 12 deletions(-)

diff --git a/tools/bpf/bpftool/main.c b/tools/bpf/bpftool/main.c
index 79dc3f1..828dde3 100644
--- a/tools/bpf/bpftool/main.c
+++ b/tools/bpf/bpftool/main.c
@@ -55,6 +55,7 @@
 bool pretty_output;
 bool json_output;
 bool show_pinned;
+int bpf_flags;
 struct pinned_obj_table prog_table;
 struct pinned_obj_table map_table;
 
@@ -341,6 +342,7 @@ int main(int argc, char **argv)
{ "pretty", no_argument,NULL,   'p' },
{ "version",no_argument,NULL,   'V' },
{ "bpffs",  no_argument,NULL,   'f' },
+   { "mapcompat",  no_argument,NULL,   'm' },
{ 0 }
};
int opt, ret;
@@ -355,7 +357,7 @@ int main(int argc, char **argv)
hash_init(map_table.table);
 
opterr = 0;
-   while ((opt = getopt_long(argc, argv, "Vhpjf",
+   while ((opt = getopt_long(argc, argv, "Vhpjfm",
  options, NULL)) >= 0) {
switch (opt) {
case 'V':
@@ -379,6 +381,9 @@ int main(int argc, char **argv)
case 'f':
show_pinned = true;
break;
+   case 'm':
+   bpf_flags = MAPS_RELAX_COMPAT;
+   break;
default:
p_err("unrecognized option '%s'", argv[optind - 1]);
if (json_output)
diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
index 40492cd..91fd697 100644
--- a/tools/bpf/bpftool/main.h
+++ b/tools/bpf/bpftool/main.h
@@ -74,7 +74,7 @@
 #define HELP_SPEC_PROGRAM  \
"PROG := { id PROG_ID | pinned FILE | tag PROG_TAG }"
 #define HELP_SPEC_OPTIONS  \
-   "OPTIONS := { {-j|--json} [{-p|--pretty}] | {-f|--bpffs} }"
+   "OPTIONS := { {-j|--json} [{-p|--pretty}] | {-f|--bpffs} | 
{-m|--mapcompat}"
 #define HELP_SPEC_MAP  \
"MAP := { id MAP_ID | pinned FILE }"
 
@@ -89,6 +89,7 @@ enum bpf_obj_type {
 extern json_writer_t *json_wtr;
 extern bool json_output;
 extern bool show_pinned;
+extern int bpf_flags;
 extern struct pinned_obj_table prog_table;
 extern struct pinned_obj_table map_table;
 
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index 99ab42c..3350289 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -908,7 +908,7 @@ static int do_load(int argc, char **argv)
}
}
 
-   obj = bpf_object__open_xattr();
+   obj = __bpf_object__open_xattr(, bpf_flags);
if (IS_ERR_OR_NULL(obj)) {
p_err("failed to open object file");
goto err_free_reuse_maps;
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 87520a8..69a4d40 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -69,6 +69,9 @@ struct bpf_load_program_attr {
__u32 prog_ifindex;
 };
 
+/* Flags to direct loading requirements */
+#define MAPS_RELAX_COMPAT  0x01
+
 /* Recommend log buffer size */
 #define BPF_LOG_BUF_SIZE (256 * 1024)
 int bpf_load_program_xattr(const struct bpf_load_program_attr *load_attr,
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 176cf55..bd71efc 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -562,8 +562,9 @@ static int compare_bpf_map(const void *_a, const void *_b)
 }
 
 static int
-bpf_object__init_maps(struct bpf_object *obj)
+bpf_object__init_maps(struct bpf_object *obj, int flags)
 {
+   bool strict = !(flags & MAPS_RELAX_COMPAT);
int i, map_idx, map_def_sz, nr_maps = 0;
Elf_Scn *scn;
Elf_Data *data;
@@ -685,7 +686,8 @@ static int compare_bpf_map(const void *_a, const void *_b)

[bpf-next PATCH v2 1/2] bpf: bpftool, add support for attaching programs to maps

2018-10-15 Thread John Fastabend

Sock map/hash introduce support for attaching programs to maps. To
date I have been doing this with custom tooling but this is less than
ideal as we shift to using bpftool as the single CLI for our BPF uses.
This patch adds new sub commands 'attach' and 'detach' to the 'prog'
command to attach programs to maps and then detach them.

Signed-off-by: John Fastabend 
---
 tools/bpf/bpftool/Documentation/bpftool-prog.rst |   11 ++
 tools/bpf/bpftool/Documentation/bpftool.rst  |2 
 tools/bpf/bpftool/bash-completion/bpftool|   19 
 tools/bpf/bpftool/prog.c |   99 ++
 4 files changed, 128 insertions(+), 3 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-prog.rst 
b/tools/bpf/bpftool/Documentation/bpftool-prog.rst
index 64156a1..12c8030 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-prog.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-prog.rst
@@ -25,6 +25,8 @@ MAP COMMANDS
 |  **bpftool** **prog dump jited**  *PROG* [{**file** *FILE* | 
**opcodes**}]
 |  **bpftool** **prog pin** *PROG* *FILE*
 |  **bpftool** **prog load** *OBJ* *FILE* [**type** *TYPE*] [**map** 
{**idx** *IDX* | **name** *NAME*} *MAP*] [**dev** *NAME*]
+|   **bpftool** **prog attach** *PROG* *ATTACH_TYPE* *MAP*
+|   **bpftool** **prog detach** *PROG* *ATTACH_TYPE* *MAP*
 |  **bpftool** **prog help**
 |
 |  *MAP* := { **id** *MAP_ID* | **pinned** *FILE* }
@@ -37,6 +39,7 @@ MAP COMMANDS
 |  **cgroup/bind4** | **cgroup/bind6** | **cgroup/post_bind4** | 
**cgroup/post_bind6** |
 |  **cgroup/connect4** | **cgroup/connect6** | **cgroup/sendmsg4** 
| **cgroup/sendmsg6**
 |  }
+|   *ATTACH_TYPE* := { **msg_verdict** | **skb_verdict** | **skb_parse** }
 
 
 DESCRIPTION
@@ -90,6 +93,14 @@ DESCRIPTION
 
  Note: *FILE* must be located in *bpffs* mount.
 
+**bpftool prog attach** *PROG* *ATTACH_TYPE* *MAP*
+  Attach bpf program *PROG* (with type specified by 
*ATTACH_TYPE*)
+  to the map *MAP*.
+
+**bpftool prog detach** *PROG* *ATTACH_TYPE* *MAP*
+  Detach bpf program *PROG* (with type specified by 
*ATTACH_TYPE*)
+  from the map *MAP*.
+
**bpftool prog help**
  Print short help message.
 
diff --git a/tools/bpf/bpftool/Documentation/bpftool.rst 
b/tools/bpf/bpftool/Documentation/bpftool.rst
index 8dda77d..25c0872 100644
--- a/tools/bpf/bpftool/Documentation/bpftool.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool.rst
@@ -26,7 +26,7 @@ SYNOPSIS
| **pin** | **event_pipe** | **help** }
 
*PROG-COMMANDS* := { **show** | **list** | **dump jited** | **dump 
xlated** | **pin**
-   | **load** | **help** }
+   | **load** | **attach** | **detach** | **help** }
 
*CGROUP-COMMANDS* := { **show** | **list** | **attach** | **detach** | 
**help** }
 
diff --git a/tools/bpf/bpftool/bash-completion/bpftool 
b/tools/bpf/bpftool/bash-completion/bpftool
index df1060b..0826519 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -292,6 +292,23 @@ _bpftool()
 fi
 return 0
 ;;
+attach|detach)
+if [[ ${#words[@]} == 7 ]]; then
+COMPREPLY=( $( compgen -W "id pinned" -- "$cur" ) )
+return 0
+fi
+
+if [[ ${#words[@]} == 6 ]]; then
+COMPREPLY=( $( compgen -W "msg_verdict skb_verdict 
skb_parse" -- "$cur" ) )
+return 0
+fi
+
+if [[ $prev == "$command" ]]; then
+COMPREPLY=( $( compgen -W "id pinned" -- "$cur" ) )
+return 0
+fi
+return 0
+;;
 load)
 local obj
 
@@ -347,7 +364,7 @@ _bpftool()
 ;;
 *)
 [[ $prev == $object ]] && \
-COMPREPLY=( $( compgen -W 'dump help pin load \
+COMPREPLY=( $( compgen -W 'dump help pin attach detach 
load \
 show list' -- "$cur" ) )
 ;;
 esac
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index b1cd3bc..99ab42c 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -77,6 +77,26 @@
[BPF_PROG_TYPE_FLOW_DISSECTOR]  = "flow_dissector",
 };
 
+static const char * const attach_type_strings[] = {
+   [BPF_SK_SKB_STREAM_PARSER] = "stream_parser",
+   [BPF_SK_SKB_STREAM_VERDICT] = "stream_verdict",
+   [BPF_SK_MSG_VERDICT] = "msg_verdict",
+   [__MAX_BPF_ATTACH_TYPE] = NULL,
+};
+
+enum bpf_attach_type parse_attach_type(const char *str)
+{
+   enum bpf_attach_type type;
+
+   for

1 2 >

1 - 100 of 115 matches

Mail list logo