[PATCH bpf 2/2] bpf: udp: Avoid calling reuseport's bpf_prog from udp_gro

2019-05-31 Thread Martin KaFai Lau
When the commit a6024562ffd7 ("udp: Add GRO functions to UDP socket")
added udp[46]_lib_lookup_skb to the udp_gro code path, it broke
the reuseport_select_sock() assumption that skb->data is pointing
to the transport header.

This patch follows an earlier __udp6_lib_err() fix by
passing a NULL skb to avoid calling the reuseport's bpf_prog.

Fixes: a6024562ffd7 ("udp: Add GRO functions to UDP socket")
Cc: Tom Herbert 
Signed-off-by: Martin KaFai Lau 
---
 net/ipv4/udp.c | 6 +-
 net/ipv6/udp.c | 2 +-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 8fb250ed53d4..85db0e3d7f3f 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -503,7 +503,11 @@ static inline struct sock *__udp4_lib_lookup_skb(struct 
sk_buff *skb,
 struct sock *udp4_lib_lookup_skb(struct sk_buff *skb,
 __be16 sport, __be16 dport)
 {
-   return __udp4_lib_lookup_skb(skb, sport, dport, &udp_table);
+   const struct iphdr *iph = ip_hdr(skb);
+
+   return __udp4_lib_lookup(dev_net(skb->dev), iph->saddr, sport,
+iph->daddr, dport, inet_iif(skb),
+inet_sdif(skb), &udp_table, NULL);
 }
 EXPORT_SYMBOL_GPL(udp4_lib_lookup_skb);
 
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 133e6370f89c..4e52c37bb836 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -243,7 +243,7 @@ struct sock *udp6_lib_lookup_skb(struct sk_buff *skb,
 
return __udp6_lib_lookup(dev_net(skb->dev), &iph->saddr, sport,
 &iph->daddr, dport, inet6_iif(skb),
-inet6_sdif(skb), &udp_table, skb);
+inet6_sdif(skb), &udp_table, NULL);
 }
 EXPORT_SYMBOL_GPL(udp6_lib_lookup_skb);
 
-- 
2.17.1



[PATCH bpf 0/2] bpf: udp: A few reuseport's bpf_prog for udp lookup

2019-05-31 Thread Martin KaFai Lau
This series has fixes when running reuseport's bpf_prog for udp lookup.
If there is reuseport's bpf_prog, the common issue is the reuseport code
path expects skb->data pointing to the transport header (udphdr here).
A couple of commits broke this expectation.  The issue is specific
to running bpf_prog, so bpf tag is used for this series.

Please refer to the individual commit message for details.

Martin KaFai Lau (2):
  bpf: udp: ipv6: Avoid running reuseport's bpf_prog from __udp6_lib_err
  bpf: udp: Avoid calling reuseport's bpf_prog from udp_gro

 net/ipv4/udp.c | 6 +-
 net/ipv6/udp.c | 4 ++--
 2 files changed, 7 insertions(+), 3 deletions(-)

-- 
2.17.1



[PATCH bpf 1/2] bpf: udp: ipv6: Avoid running reuseport's bpf_prog from __udp6_lib_err

2019-05-31 Thread Martin KaFai Lau
__udp6_lib_err() may be called when handling icmpv6 message. For example,
the icmpv6 toobig(type=2).  __udp6_lib_lookup() is then called
which may call reuseport_select_sock().  reuseport_select_sock() will
call into a bpf_prog (if there is one).

reuseport_select_sock() is expecting the skb->data pointing to the
transport header (udphdr in this case).  For example, run_bpf_filter()
is pulling the transport header.

However, in the __udp6_lib_err() path, the skb->data is pointing to the
ipv6hdr instead of the udphdr.

One option is to pull and push the ipv6hdr in __udp6_lib_err().
Instead of doing this, this patch follows how the original
commit 538950a1b752 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF")
was done in IPv4, which has passed a NULL skb pointer to
reuseport_select_sock().

Fixes: 538950a1b752 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF")
Cc: Craig Gallek 
Signed-off-by: Martin KaFai Lau 
---
 net/ipv6/udp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 07fa579dfb96..133e6370f89c 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -515,7 +515,7 @@ int __udp6_lib_err(struct sk_buff *skb, struct 
inet6_skb_parm *opt,
struct net *net = dev_net(skb->dev);
 
sk = __udp6_lib_lookup(net, daddr, uh->dest, saddr, uh->source,
-  inet6_iif(skb), inet6_sdif(skb), udptable, skb);
+  inet6_iif(skb), inet6_sdif(skb), udptable, NULL);
if (!sk) {
/* No socket for error: try tunnels before discarding */
sk = ERR_PTR(-ENOENT);
-- 
2.17.1



Re: [PATCH bpf-next v7 2/4] tools, bpf: add tcp.h to tools/uapi

2021-01-13 Thread Martin KaFai Lau
On Tue, Jan 12, 2021 at 02:38:45PM -0800, Stanislav Fomichev wrote:
> Next test is using struct tcp_zerocopy_receive which was added in v4.18.
Instead of "Next", it is the test in the previous patch.

Instead of having patch 2 fixing patch 1, 
the changes in testing/selftests/bpf/* in patch 1 make more sense
to merge it with this patch.  With this change, for patch 1 and 2:

Acked-by: Martin KaFai Lau


Re: [PATCH bpf-next v7 3/4] bpf: try to avoid kzalloc in cgroup/{s,g}etsockopt

2021-01-13 Thread Martin KaFai Lau
On Tue, Jan 12, 2021 at 02:38:46PM -0800, Stanislav Fomichev wrote:
> When we attach a bpf program to cgroup/getsockopt any other getsockopt()
> syscall starts incurring kzalloc/kfree cost.
> 
> Let add a small buffer on the stack and use it for small (majority)
> {s,g}etsockopt values. The buffer is small enough to fit into
> the cache line and cover the majority of simple options (most
> of them are 4 byte ints).
> 
> It seems natural to do the same for setsockopt, but it's a bit more
> involved when the BPF program modifies the data (where we have to
> kmalloc). The assumption is that for the majority of setsockopt
> calls (which are doing pure BPF options or apply policy) this
> will bring some benefit as well.
> 
> Without this patch (we remove about 1% __kmalloc):
>  3.38% 0.07%  tcp_mmap  [kernel.kallsyms]  [k] 
> __cgroup_bpf_run_filter_getsockopt
> |
>  --3.30%--__cgroup_bpf_run_filter_getsockopt
>|
>     --0.81%--__kmalloc
> 
> Signed-off-by: Stanislav Fomichev 
> Cc: Martin KaFai Lau 
> Cc: Song Liu 
> ---
>  include/linux/filter.h |  5 
>  kernel/bpf/cgroup.c| 52 --
>  2 files changed, 50 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 29c27656165b..8739f1d4cac4 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1281,6 +1281,11 @@ struct bpf_sysctl_kern {
>   u64 tmp_reg;
>  };
>  
> +#define BPF_SOCKOPT_KERN_BUF_SIZE32
> +struct bpf_sockopt_buf {
> + u8  data[BPF_SOCKOPT_KERN_BUF_SIZE];
> +};
> +
>  struct bpf_sockopt_kern {
>   struct sock *sk;
>   u8  *optval;
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index 416e7738981b..dbeef7afbbf9 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -1298,7 +1298,8 @@ static bool __cgroup_bpf_prog_array_is_empty(struct 
> cgroup *cgrp,
>   return empty;
>  }
>  
> -static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen)
> +static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen,
> +  struct bpf_sockopt_buf *buf)
>  {
>   if (unlikely(max_optlen < 0))
>   return -EINVAL;
> @@ -1310,6 +1311,15 @@ static int sockopt_alloc_buf(struct bpf_sockopt_kern 
> *ctx, int max_optlen)
>   max_optlen = PAGE_SIZE;
>   }
>  
> + if (max_optlen <= sizeof(buf->data)) {
> + /* When the optval fits into BPF_SOCKOPT_KERN_BUF_SIZE
> +  * bytes avoid the cost of kzalloc.
> +  */
> + ctx->optval = buf->data;
> + ctx->optval_end = ctx->optval + max_optlen;
> + return max_optlen;
> + }
> +
>   ctx->optval = kzalloc(max_optlen, GFP_USER);
>   if (!ctx->optval)
>   return -ENOMEM;
> @@ -1319,16 +1329,26 @@ static int sockopt_alloc_buf(struct bpf_sockopt_kern 
> *ctx, int max_optlen)
>   return max_optlen;
>  }
>  
> -static void sockopt_free_buf(struct bpf_sockopt_kern *ctx)
> +static void sockopt_free_buf(struct bpf_sockopt_kern *ctx,
> +  struct bpf_sockopt_buf *buf)
>  {
> + if (ctx->optval == buf->data)
> + return;
>   kfree(ctx->optval);
>  }
>  
> +static bool sockopt_buf_allocated(struct bpf_sockopt_kern *ctx,
> +   struct bpf_sockopt_buf *buf)
> +{
> + return ctx->optval != buf->data;
> +}
> +
>  int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level,
>  int *optname, char __user *optval,
>  int *optlen, char **kernel_optval)
>  {
>   struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> + struct bpf_sockopt_buf buf = {};
>   struct bpf_sockopt_kern ctx = {
>   .sk = sk,
>   .level = *level,
> @@ -1350,7 +1370,7 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, 
> int *level,
>*/
>   max_optlen = max_t(int, 16, *optlen);
>  
> - max_optlen = sockopt_alloc_buf(&ctx, max_optlen);
> + max_optlen = sockopt_alloc_buf(&ctx, max_optlen, &buf);
>   if (max_optlen < 0)
>   return max_optlen;
>  
> @@ -1390,14 +1410,31 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock 
> *sk, int *level,
>*/
>   if (ctx.optlen != 0) {
>   *optlen = ctx.optlen;
> -   

Re: More flexible BPF socket inet_lookup hooking after listening sockets are dispatched

2021-01-21 Thread Martin KaFai Lau
On Thu, Jan 21, 2021 at 09:40:19PM +0100, Shanti Lombard wrote:
> Le 2021-01-21 12:14, Jakub Sitnicki a écrit :
> > On Wed, Jan 20, 2021 at 10:06 PM CET, Alexei Starovoitov wrote:
> > 
> > There is also documentation in the kernel:
> > 
> > https://www.kernel.org/doc/html/latest/bpf/prog_sk_lookup.html
> > 
> 
> Thank you, I saw it, it's well written and very much explains it all.
> 
> > 
> > Existing hook is placed before regular listening/unconnected socket
> > lookup to prevent port hijacking on the unprivileged range.
> > 
> 
> Yes, from the point of view of the BPF program. However from the point of
> view of a legitimate service listening on a port that might be blocked by
> the BPF program, BPF is actually hijacking a port bind.
> 
> That being said, if you install the BPF filter, you should know what you are
> doing.
> 
> > > > The suggestion above would work for my use case, but there is another
> > > > possibility to make the same use cases possible : implement in
> > > > BPF (or
> > > > allow BPF to call) the C and E steps above so the BPF program can
> > > > supplant the kernel behavior. I find this solution less elegant
> > > > and it
> > > > might not work well in case there are multiple inet_lookup BPF
> > > > programs
> > > > installed.
> > 
> > Having a BPF helper available to BPF sk_lookup programs that looks up a
> > socket by packet 4-tuple and netns ID in tcp/udp hashtables sounds
> > reasonable to me. You gain the flexibility that you describe without
> > adding code on the hot path.
Agree that a helper to lookup the inet_hash is probably a better way.
There are some existing lookup helper examples as you also pointed out.

I would avoid adding new hooks doing the same thing.
The same bpf prog will be called multiple times, the bpf running
ctx has to be initialized multiple times...etc.

> 
> True, if you consider that hot path should not be slowed down. It makes
> sense. However, for me, it seems the implementation would be more difficult.
> 
> Looking at existing BPF helpers 
>  > I found bpf_sk_lookup_tcp and bpf_sk_lookup_ucp that should yield a socket
> from a matching tuple and netns. If that's true and usable from within BPF
> sk_lookup then it's just a matter of implementing it and the kernel is
> already ready for such use cases.
> 
> Shanti


Re: [PATCH bpf-next 2/2] selftests/bpf: verify that rebinding to port < 1024 from BPF works

2021-01-21 Thread Martin KaFai Lau
On Wed, Jan 20, 2021 at 05:22:41PM -0800, Stanislav Fomichev wrote:
> BPF rewrites from 111 to 111, but it still should mark the port as
> "changed".
> We also verify that if port isn't touched by BPF, it's still prohibited.
> 
> Signed-off-by: Stanislav Fomichev 
> ---
>  .../selftests/bpf/prog_tests/bind_perm.c  | 88 +++
>  tools/testing/selftests/bpf/progs/bind_perm.c | 36 
>  2 files changed, 124 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/bind_perm.c
>  create mode 100644 tools/testing/selftests/bpf/progs/bind_perm.c
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/bind_perm.c 
> b/tools/testing/selftests/bpf/prog_tests/bind_perm.c
> new file mode 100644
> index ..840a04ac9042
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/bind_perm.c
> @@ -0,0 +1,88 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include 
> +#include "bind_perm.skel.h"
> +
> +#include 
> +#include 
> +#include 
> +
> +static int duration;
> +
> +void try_bind(int port, int expected_errno)
> +{
> + struct sockaddr_in sin = {};
> + int fd = -1;
> +
> + fd = socket(AF_INET, SOCK_STREAM, 0);
> + if (CHECK(fd < 0, "fd", "errno %d", errno))
> + goto close_socket;
> +
> + sin.sin_family = AF_INET;
> + sin.sin_port = htons(port);
> +
> + errno = 0;
> + bind(fd, (struct sockaddr *)&sin, sizeof(sin));
> + CHECK(errno != expected_errno, "bind", "errno %d, expected %d",
> +   errno, expected_errno);
> +
> +close_socket:
> + if (fd >= 0)
> + close(fd);
> +}
> +
> +void cap_net_bind_service(cap_flag_value_t flag)
> +{
> + const cap_value_t cap_net_bind_service = CAP_NET_BIND_SERVICE;
> + cap_t caps;
> +
> + caps = cap_get_proc();
> + if (CHECK(!caps, "cap_get_proc", "errno %d", errno))
> + goto free_caps;
> +
> + if (CHECK(cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap_net_bind_service,
> +CAP_CLEAR),
> +   "cap_set_flag", "errno %d", errno))
> + goto free_caps;
> +
> + if (CHECK(cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap_net_bind_service,
> +CAP_CLEAR),
> +   "cap_set_flag", "errno %d", errno))
> + goto free_caps;
> +
> + if (CHECK(cap_set_proc(caps), "cap_set_proc", "errno %d", errno))
> + goto free_caps;
> +
> +free_caps:
> + if (CHECK(cap_free(caps), "cap_free", "errno %d", errno))
> + goto free_caps;
> +}
> +
> +void test_bind_perm(void)
> +{
> + struct bind_perm *skel;
> + int cgroup_fd;
> +
> + cgroup_fd = test__join_cgroup("/bind_perm");
> + if (CHECK(cgroup_fd < 0, "cg-join", "errno %d", errno))
> + return;
> +
> + skel = bind_perm__open_and_load();
> + if (CHECK(!skel, "skel-load", "errno %d", errno))
> + goto close_cgroup_fd;
> +
> + skel->links.bind_v4_prog = 
> bpf_program__attach_cgroup(skel->progs.bind_v4_prog, cgroup_fd);
> + if (CHECK(IS_ERR(skel->links.bind_v4_prog),
> +   "cg-attach", "bind4 %ld",
> +   PTR_ERR(skel->links.bind_v4_prog)))
> + goto close_skeleton;
> +
> + cap_net_bind_service(CAP_CLEAR);
> + try_bind(110, EACCES);
> + try_bind(111, 0);
> + cap_net_bind_service(CAP_SET);
> +
> +close_skeleton:
> + bind_perm__destroy(skel);
> +close_cgroup_fd:
> + close(cgroup_fd);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/bind_perm.c 
> b/tools/testing/selftests/bpf/progs/bind_perm.c
> new file mode 100644
> index ..2194587ec806
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/bind_perm.c
> @@ -0,0 +1,36 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +SEC("cgroup/bind4")
> +int bind_v4_prog(struct bpf_sock_addr *ctx)
> +{
> + struct bpf_sock *sk;
> + __u32 user_ip4;
> + __u16 user_port;
> +
> + sk = ctx->sk;
> + if (!sk)
> + return 0;
> +
> + if (sk->family != AF_INET)
> + return 0;
> +
> + if (ctx->type != SOCK_STREAM)
> + return 0;
> +
> + /* Rewriting to the same value should still cause
> +  * permission check to be bypassed.
> +  */
> + if (ctx->user_port == bpf_htons(111))
> + ctx->user_port = bpf_htons(111);
iiuc, this overwrite is essentially the way to ensure the bind
will succeed (override CAP_NET_BIND_SERVICE in this particular case?).

It seems to be okay if we consider most of the use cases is rewriting
to a different port.

However, it is quite un-intuitive to the bpf prog to overwrite with
the same user_port just to ensure this port can be binded successfully
later.

Is user_port the only case? How about other fields in bpf_sock_addr?

> +
> + return 1;
> +}
> +
> +char _license[] SEC("license") = "GPL";
> -- 
> 2.30.0.284.gd98b1dd5eaa7-goog
> 


Re: [PATCH bpf-next 2/2] selftests/bpf: verify that rebinding to port < 1024 from BPF works

2021-01-21 Thread Martin KaFai Lau
On Thu, Jan 21, 2021 at 02:57:44PM -0800, s...@google.com wrote:
> On 01/21, Martin KaFai Lau wrote:
> > On Wed, Jan 20, 2021 at 05:22:41PM -0800, Stanislav Fomichev wrote:
> > > BPF rewrites from 111 to 111, but it still should mark the port as
> > > "changed".
> > > We also verify that if port isn't touched by BPF, it's still prohibited.
> > >
> > > Signed-off-by: Stanislav Fomichev 
> > > ---
> > >  .../selftests/bpf/prog_tests/bind_perm.c  | 88 +++
> > >  tools/testing/selftests/bpf/progs/bind_perm.c | 36 
> > >  2 files changed, 124 insertions(+)
> > >  create mode 100644 tools/testing/selftests/bpf/prog_tests/bind_perm.c
> > >  create mode 100644 tools/testing/selftests/bpf/progs/bind_perm.c
> > >
> > > diff --git a/tools/testing/selftests/bpf/prog_tests/bind_perm.c
> > b/tools/testing/selftests/bpf/prog_tests/bind_perm.c
> > > new file mode 100644
> > > index ..840a04ac9042
> > > --- /dev/null
> > > +++ b/tools/testing/selftests/bpf/prog_tests/bind_perm.c
> > > @@ -0,0 +1,88 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +#include 
> > > +#include "bind_perm.skel.h"
> > > +
> > > +#include 
> > > +#include 
> > > +#include 
> > > +
> > > +static int duration;
> > > +
> > > +void try_bind(int port, int expected_errno)
> > > +{
> > > + struct sockaddr_in sin = {};
> > > + int fd = -1;
> > > +
> > > + fd = socket(AF_INET, SOCK_STREAM, 0);
> > > + if (CHECK(fd < 0, "fd", "errno %d", errno))
> > > + goto close_socket;
> > > +
> > > + sin.sin_family = AF_INET;
> > > + sin.sin_port = htons(port);
> > > +
> > > + errno = 0;
> > > + bind(fd, (struct sockaddr *)&sin, sizeof(sin));
> > > + CHECK(errno != expected_errno, "bind", "errno %d, expected %d",
> > > +   errno, expected_errno);
> > > +
> > > +close_socket:
> > > + if (fd >= 0)
> > > + close(fd);
> > > +}
> > > +
> > > +void cap_net_bind_service(cap_flag_value_t flag)
> > > +{
> > > + const cap_value_t cap_net_bind_service = CAP_NET_BIND_SERVICE;
> > > + cap_t caps;
> > > +
> > > + caps = cap_get_proc();
> > > + if (CHECK(!caps, "cap_get_proc", "errno %d", errno))
> > > + goto free_caps;
> > > +
> > > + if (CHECK(cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap_net_bind_service,
> > > +CAP_CLEAR),
> > > +   "cap_set_flag", "errno %d", errno))
> > > + goto free_caps;
> > > +
> > > + if (CHECK(cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap_net_bind_service,
> > > +CAP_CLEAR),
> > > +   "cap_set_flag", "errno %d", errno))
> > > + goto free_caps;
> > > +
> > > + if (CHECK(cap_set_proc(caps), "cap_set_proc", "errno %d", errno))
> > > + goto free_caps;
> > > +
> > > +free_caps:
> > > + if (CHECK(cap_free(caps), "cap_free", "errno %d", errno))
> > > + goto free_caps;
> > > +}
> > > +
> > > +void test_bind_perm(void)
> > > +{
> > > + struct bind_perm *skel;
> > > + int cgroup_fd;
> > > +
> > > + cgroup_fd = test__join_cgroup("/bind_perm");
> > > + if (CHECK(cgroup_fd < 0, "cg-join", "errno %d", errno))
> > > + return;
> > > +
> > > + skel = bind_perm__open_and_load();
> > > + if (CHECK(!skel, "skel-load", "errno %d", errno))
> > > + goto close_cgroup_fd;
> > > +
> > > + skel->links.bind_v4_prog =
> > bpf_program__attach_cgroup(skel->progs.bind_v4_prog, cgroup_fd);
> > > + if (CHECK(IS_ERR(skel->links.bind_v4_prog),
> > > +   "cg-attach", "bind4 %ld",
> > > +   PTR_ERR(skel->links.bind_v4_prog)))
> > > + goto close_skeleton;
> > > +
> > > + cap_net_bind_service(CAP_CLEAR);
> > > + try_bind(110, EACCES);
> > > + try_bind(111, 0);
> > > + cap_net_bind_service(CAP_SET);
> > > +
> > > +close_skeleton:
> > > + bind_perm__destroy(skel);
> > > +close_cgroup_fd:
&g

Re: [PATCH bpf-next 2/2] selftests/bpf: verify that rebinding to port < 1024 from BPF works

2021-01-21 Thread Martin KaFai Lau
On Thu, Jan 21, 2021 at 04:30:08PM -0800, s...@google.com wrote:
> On 01/21, Martin KaFai Lau wrote:
> > On Thu, Jan 21, 2021 at 02:57:44PM -0800, s...@google.com wrote:
> > > On 01/21, Martin KaFai Lau wrote:
> > > > On Wed, Jan 20, 2021 at 05:22:41PM -0800, Stanislav Fomichev wrote:
> > > > > BPF rewrites from 111 to 111, but it still should mark the port as
> > > > > "changed".
> > > > > We also verify that if port isn't touched by BPF, it's still
> > prohibited.
> > > > >
> > > > > Signed-off-by: Stanislav Fomichev 
> > > > > ---
> > > > >  .../selftests/bpf/prog_tests/bind_perm.c  | 88
> > +++
> > > > >  tools/testing/selftests/bpf/progs/bind_perm.c | 36 
> > > > >  2 files changed, 124 insertions(+)
> > > > >  create mode 100644
> > tools/testing/selftests/bpf/prog_tests/bind_perm.c
> > > > >  create mode 100644 tools/testing/selftests/bpf/progs/bind_perm.c
> > > > >
> > > > > diff --git a/tools/testing/selftests/bpf/prog_tests/bind_perm.c
> > > > b/tools/testing/selftests/bpf/prog_tests/bind_perm.c
> > > > > new file mode 100644
> > > > > index ..840a04ac9042
> > > > > --- /dev/null
> > > > > +++ b/tools/testing/selftests/bpf/prog_tests/bind_perm.c
> > > > > @@ -0,0 +1,88 @@
> > > > > +// SPDX-License-Identifier: GPL-2.0
> > > > > +#include 
> > > > > +#include "bind_perm.skel.h"
> > > > > +
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > > +
> > > > > +static int duration;
> > > > > +
> > > > > +void try_bind(int port, int expected_errno)
> > > > > +{
> > > > > + struct sockaddr_in sin = {};
> > > > > + int fd = -1;
> > > > > +
> > > > > + fd = socket(AF_INET, SOCK_STREAM, 0);
> > > > > + if (CHECK(fd < 0, "fd", "errno %d", errno))
> > > > > + goto close_socket;
> > > > > +
> > > > > + sin.sin_family = AF_INET;
> > > > > + sin.sin_port = htons(port);
> > > > > +
> > > > > + errno = 0;
> > > > > + bind(fd, (struct sockaddr *)&sin, sizeof(sin));
> > > > > + CHECK(errno != expected_errno, "bind", "errno %d, expected %d",
> > > > > +   errno, expected_errno);
> > > > > +
> > > > > +close_socket:
> > > > > + if (fd >= 0)
> > > > > + close(fd);
> > > > > +}
> > > > > +
> > > > > +void cap_net_bind_service(cap_flag_value_t flag)
> > > > > +{
> > > > > + const cap_value_t cap_net_bind_service = CAP_NET_BIND_SERVICE;
> > > > > + cap_t caps;
> > > > > +
> > > > > + caps = cap_get_proc();
> > > > > + if (CHECK(!caps, "cap_get_proc", "errno %d", errno))
> > > > > + goto free_caps;
> > > > > +
> > > > > + if (CHECK(cap_set_flag(caps, CAP_EFFECTIVE, 1,
> > &cap_net_bind_service,
> > > > > +CAP_CLEAR),
> > > > > +   "cap_set_flag", "errno %d", errno))
> > > > > + goto free_caps;
> > > > > +
> > > > > + if (CHECK(cap_set_flag(caps, CAP_EFFECTIVE, 1,
> > &cap_net_bind_service,
> > > > > +CAP_CLEAR),
> > > > > +   "cap_set_flag", "errno %d", errno))
> > > > > + goto free_caps;
> > > > > +
> > > > > + if (CHECK(cap_set_proc(caps), "cap_set_proc", "errno %d", 
> > > > > errno))
> > > > > + goto free_caps;
> > > > > +
> > > > > +free_caps:
> > > > > + if (CHECK(cap_free(caps), "cap_free", "errno %d", errno))
> > > > > + goto free_caps;
> > > > > +}
> > > > > +
> > > > > +void test_bind_perm(void)
> > > > > +{
> > > > > + struct bind_perm *skel;
>

Re: [PATCH bpf-next 2/2] selftests/bpf: verify that rebinding to port < 1024 from BPF works

2021-01-22 Thread Martin KaFai Lau
On Fri, Jan 22, 2021 at 08:16:40AM -0800, s...@google.com wrote:
> On 01/21, Martin KaFai Lau wrote:
> > On Thu, Jan 21, 2021 at 04:30:08PM -0800, s...@google.com wrote:
> > > On 01/21, Martin KaFai Lau wrote:
> > > > On Thu, Jan 21, 2021 at 02:57:44PM -0800, s...@google.com wrote:
> > > > > On 01/21, Martin KaFai Lau wrote:
> > > > > > On Wed, Jan 20, 2021 at 05:22:41PM -0800, Stanislav Fomichev
> > wrote:
> > > > > > > BPF rewrites from 111 to 111, but it still should mark the
> > port as
> > > > > > > "changed".
> > > > > > > We also verify that if port isn't touched by BPF, it's still
> > > > prohibited.
> > > > > > >
> > > > > > > Signed-off-by: Stanislav Fomichev 
> > > > > > > ---
> > > > > > >  .../selftests/bpf/prog_tests/bind_perm.c  | 88
> > > > +++
> > > > > > >  tools/testing/selftests/bpf/progs/bind_perm.c | 36 
> > > > > > >  2 files changed, 124 insertions(+)
> > > > > > >  create mode 100644
> > > > tools/testing/selftests/bpf/prog_tests/bind_perm.c
> > > > > > >  create mode 100644
> > tools/testing/selftests/bpf/progs/bind_perm.c
> > > > > > >
> > > > > > > diff --git a/tools/testing/selftests/bpf/prog_tests/bind_perm.c
> > > > > > b/tools/testing/selftests/bpf/prog_tests/bind_perm.c
> > > > > > > new file mode 100644
> > > > > > > index ..840a04ac9042
> > > > > > > --- /dev/null
> > > > > > > +++ b/tools/testing/selftests/bpf/prog_tests/bind_perm.c
> > > > > > > @@ -0,0 +1,88 @@
> > > > > > > +// SPDX-License-Identifier: GPL-2.0
> > > > > > > +#include 
> > > > > > > +#include "bind_perm.skel.h"
> > > > > > > +
> > > > > > > +#include 
> > > > > > > +#include 
> > > > > > > +#include 
> > > > > > > +
> > > > > > > +static int duration;
> > > > > > > +
> > > > > > > +void try_bind(int port, int expected_errno)
> > > > > > > +{
> > > > > > > + struct sockaddr_in sin = {};
> > > > > > > + int fd = -1;
> > > > > > > +
> > > > > > > + fd = socket(AF_INET, SOCK_STREAM, 0);
> > > > > > > + if (CHECK(fd < 0, "fd", "errno %d", errno))
> > > > > > > + goto close_socket;
> > > > > > > +
> > > > > > > + sin.sin_family = AF_INET;
> > > > > > > + sin.sin_port = htons(port);
> > > > > > > +
> > > > > > > + errno = 0;
> > > > > > > + bind(fd, (struct sockaddr *)&sin, sizeof(sin));
> > > > > > > + CHECK(errno != expected_errno, "bind", "errno %d, expected
> > %d",
> > > > > > > +   errno, expected_errno);
> > > > > > > +
> > > > > > > +close_socket:
> > > > > > > + if (fd >= 0)
> > > > > > > + close(fd);
> > > > > > > +}
> > > > > > > +
> > > > > > > +void cap_net_bind_service(cap_flag_value_t flag)
> > > > > > > +{
> > > > > > > + const cap_value_t cap_net_bind_service = CAP_NET_BIND_SERVICE;
> > > > > > > + cap_t caps;
> > > > > > > +
> > > > > > > + caps = cap_get_proc();
> > > > > > > + if (CHECK(!caps, "cap_get_proc", "errno %d", errno))
> > > > > > > + goto free_caps;
> > > > > > > +
> > > > > > > + if (CHECK(cap_set_flag(caps, CAP_EFFECTIVE, 1,
> > > > &cap_net_bind_service,
> > > > > > > +CAP_CLEAR),
> > > > > > > +   "cap_set_flag", "errno %d", errno))
> > > > > > > + goto free_caps;
> > > > > > > +
> > > > > > > + if (CHECK(cap_set_flag(caps, CAP_EFFECTIVE, 1,
> > > > &cap_net_bind_service,
> > > > > > > +CAP_CLE

Re: [PATCH v1 bpf-next 05/11] tcp: Migrate TCP_NEW_SYN_RECV requests.

2020-12-09 Thread Martin KaFai Lau
On Tue, Dec 01, 2020 at 11:44:12PM +0900, Kuniyuki Iwashima wrote:
> This patch renames reuseport_select_sock() to __reuseport_select_sock() and
> adds two wrapper function of it to pass the migration type defined in the
> previous commit.
> 
>   reuseport_select_sock  : BPF_SK_REUSEPORT_MIGRATE_NO
>   reuseport_select_migrated_sock : BPF_SK_REUSEPORT_MIGRATE_REQUEST
> 
> As mentioned before, we have to select a new listener for TCP_NEW_SYN_RECV
> requests at receiving the final ACK or sending a SYN+ACK. Therefore, this
> patch also changes the code to call reuseport_select_migrated_sock() even
> if the listening socket is TCP_CLOSE. If we can pick out a listening socket
> from the reuseport group, we rewrite request_sock.rsk_listener and resume
> processing the request.
> 
> Reviewed-by: Benjamin Herrenschmidt 
> Signed-off-by: Kuniyuki Iwashima 
> ---
>  include/net/inet_connection_sock.h | 12 +++
>  include/net/request_sock.h | 13 
>  include/net/sock_reuseport.h   |  8 +++
>  net/core/sock_reuseport.c  | 34 --
>  net/ipv4/inet_connection_sock.c| 13 ++--
>  net/ipv4/tcp_ipv4.c|  9 ++--
>  net/ipv6/tcp_ipv6.c|  9 ++--
>  7 files changed, 81 insertions(+), 17 deletions(-)
> 
> diff --git a/include/net/inet_connection_sock.h 
> b/include/net/inet_connection_sock.h
> index 2ea2d743f8fc..1e0958f5eb21 100644
> --- a/include/net/inet_connection_sock.h
> +++ b/include/net/inet_connection_sock.h
> @@ -272,6 +272,18 @@ static inline void inet_csk_reqsk_queue_added(struct 
> sock *sk)
>   reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue);
>  }
>  
> +static inline void inet_csk_reqsk_queue_migrated(struct sock *sk,
> +  struct sock *nsk,
> +  struct request_sock *req)
> +{
> + reqsk_queue_migrated(&inet_csk(sk)->icsk_accept_queue,
> +  &inet_csk(nsk)->icsk_accept_queue,
> +  req);
> + sock_put(sk);
not sure if it is safe to do here.
IIUC, when the req->rsk_refcnt is held, it also holds a refcnt
to req->rsk_listener such that sock_hold(req->rsk_listener) is
safe because its sk_refcnt is not zero.

> + sock_hold(nsk);
> + req->rsk_listener = nsk;
> +}
> +

[ ... ]

> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index 361efe55b1ad..e71653c6eae2 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -743,8 +743,17 @@ static void reqsk_timer_handler(struct timer_list *t)
>   struct request_sock_queue *queue = &icsk->icsk_accept_queue;
>   int max_syn_ack_retries, qlen, expire = 0, resend = 0;
>  
> - if (inet_sk_state_load(sk_listener) != TCP_LISTEN)
> - goto drop;
> + if (inet_sk_state_load(sk_listener) != TCP_LISTEN) {
> + sk_listener = reuseport_select_migrated_sock(sk_listener,
> +  
> req_to_sk(req)->sk_hash, NULL);
> + if (!sk_listener) {
> + sk_listener = req->rsk_listener;
> + goto drop;
> + }
> + inet_csk_reqsk_queue_migrated(req->rsk_listener, sk_listener, 
> req);
> + icsk = inet_csk(sk_listener);
> + queue = &icsk->icsk_accept_queue;
> + }
>  
>   max_syn_ack_retries = icsk->icsk_syn_retries ? : 
> net->ipv4.sysctl_tcp_synack_retries;
>   /* Normally all the openreqs are young and become mature
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index e4b31e70bd30..9a9aa27c6069 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1973,8 +1973,13 @@ int tcp_v4_rcv(struct sk_buff *skb)
>   goto csum_error;
>   }
>   if (unlikely(sk->sk_state != TCP_LISTEN)) {
> - inet_csk_reqsk_queue_drop_and_put(sk, req);
> - goto lookup;
> + nsk = reuseport_select_migrated_sock(sk, 
> req_to_sk(req)->sk_hash, skb);
> + if (!nsk) {
> + inet_csk_reqsk_queue_drop_and_put(sk, req);
> + goto lookup;
> + }
> + inet_csk_reqsk_queue_migrated(sk, nsk, req);
> + sk = nsk;
>   }
>   /* We own a reference on the listener, increase it again
>* as we might lose it too soon.
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index 992cbf3eb9e3..ff11f3c0cb96 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -1635,8 +1635,13 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff 
> *skb)
>   goto csum_error;
>   }
>   if (unlikely(sk->sk_state != TCP_LISTEN)) {
> - inet_csk_reqsk_queue_drop_an

Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.

2020-12-09 Thread Martin KaFai Lau
On Thu, Dec 10, 2020 at 01:57:19AM +0900, Kuniyuki Iwashima wrote:
[ ... ]

> > > > I think it is a bit complex to pass the new listener from
> > > > reuseport_detach_sock() to inet_csk_listen_stop().
> > > > 
> > > > __tcp_close/tcp_disconnect/tcp_abort
> > > >  |-tcp_set_state
> > > >  |  |-unhash
> > > >  | |-reuseport_detach_sock (return nsk)
> > > >  |-inet_csk_listen_stop
> > > Picking the new listener does not have to be done in
> > > reuseport_detach_sock().
> > > 
> > > IIUC, it is done there only because it prefers to pick
> > > the last sk from socks[] when bpf prog is not attached.
> > > This seems to get into the way of exploring other potential
> > > implementation options.
> > 
> > Yes.
> > This is just idea, but we can reserve the last index of socks[] to hold the
> > last 'moved' socket in reuseport_detach_sock() and use it in
> > inet_csk_listen_stop().
> > 
> > 
> > > Merging the discussion on the last socks[] pick from another thread:
> > > >
> > > > I think most applications start new listeners before closing listeners, 
> > > > in
> > > > this case, selecting the moved socket as the new listener works well.
> > > >
> > > >
> > > > > That said, if it is still desired to do a random pick by kernel when
> > > > > there is no bpf prog, it probably makes sense to guard it in a sysctl 
> > > > > as
> > > > > suggested in another reply.  To keep it simple, I would also keep this
> > > > > kernel-pick consistent instead of request socket is doing something
> > > > > different from the unhash path.
> > > >
> > > > Then, is this way better to keep kernel-pick consistent?
> > > >
> > > >   1. call reuseport_select_migrated_sock() without sk_hash from any path
> > > >   2. generate a random number in reuseport_select_migrated_sock()
> > > >   3. pass it to __reuseport_select_sock() only for select-by-hash
> > > >   (4. pass 0 as sk_hash to bpf_run_sk_reuseport not to use it)
> > > >   5. do migration per queue in inet_csk_listen_stop() or per request in
> > > >  receive path.
> > > >
> > > > I understand it is beautiful to keep consistensy, but also think
> > > > the kernel-pick with heuristic performs better than random-pick.
> > > I think discussing the best kernel pick without explicit user input
> > > is going to be a dead end. There is always a case that
> > > makes this heuristic (or guess) fail.  e.g. what if multiple
> > > sk(s) being closed are always the last one in the socks[]?
> > > all their child sk(s) will then be piled up at one listen sk
> > > because the last socks[] is always picked?
> > 
> > There can be such a case, but it means the newly listened sockets are
> > closed earlier than old ones.
> > 
> > 
> > > Lets assume the last socks[] is indeed the best for all cases.  Then why
> > > the in-progress req don't pick it this way?  I feel the implementation
> > > is doing what is convenient at that point.  And that is fine, I think
> > 
> > In this patchset, I originally assumed four things:
> > 
> >   migration should be done
> > (i)   from old to new
> > (ii)  to redistribute requests evenly as possible
> > (iii) to keep the order of requests in the queue
> >   (resulting in splicing queues)
> > (iv)  in O(1) for scalability
> >   (resulting in fix-up rsk_listener approach)
> > 
> > I selected the last socket in unhash path to satisfy above four because the
> > last socket changes at every close() syscall if application closes from
> > older socket.
> > 
> > But in receiving ACK or retransmitting SYN+ACK, we cannot get the last
> > 'moved' socket. Even if we reserve the last 'moved' socket in the last
> > index by the idea above, we cannot sure the last socket is changed after
> > close() for each req->listener. For example, we have listeners A, B, C, and
> > D, and then call close(A) and close(B), and receive the final ACKs for A
> > and B, then both of them are assigned to C. In this case, A for D and B for
> > C is desired. So, selecting the last socket in socks[] for incoming
> > requests cannnot realize (ii).
> > 
> > This is why I selected the last moved socket in unhash path and a random
> > listener in receive path.
> > 
> > 
> > > for kernel-pick, it should just go for simplicity and stay with
> > > the random(/hash) pick instead of pretending the kernel knows the
> > > application must operate in a certain way.  It is fine
> > > that the pick was wrong, the kernel will eventually move the
> > > childs/reqs to the survived listen sk.
> > 
> > Exactly. Also the heuristic way is not fair for every application.
> > 
> > After reading below idea (migrated_sk), I think random-pick is better
> > at simplicity and passing each sk.
> > 
> > 
> > > [ I still think the kernel should not even pick if
> > >   there is no bpf prog to instruct how to pick
> > >   but I am fine as long as there is a sysctl to
> > >   guard this. ]
> > 
> > Unless different applications listen on the same port, random-pick can save
> > connections which would 

Re: [PATCH v1 bpf-next 05/11] tcp: Migrate TCP_NEW_SYN_RECV requests.

2020-12-10 Thread Martin KaFai Lau
On Thu, Dec 10, 2020 at 02:15:38PM +0900, Kuniyuki Iwashima wrote:
> From:   Martin KaFai Lau 
> Date:   Wed, 9 Dec 2020 16:07:07 -0800
> > On Tue, Dec 01, 2020 at 11:44:12PM +0900, Kuniyuki Iwashima wrote:
> > > This patch renames reuseport_select_sock() to __reuseport_select_sock() 
> > > and
> > > adds two wrapper function of it to pass the migration type defined in the
> > > previous commit.
> > > 
> > >   reuseport_select_sock  : BPF_SK_REUSEPORT_MIGRATE_NO
> > >   reuseport_select_migrated_sock : BPF_SK_REUSEPORT_MIGRATE_REQUEST
> > > 
> > > As mentioned before, we have to select a new listener for TCP_NEW_SYN_RECV
> > > requests at receiving the final ACK or sending a SYN+ACK. Therefore, this
> > > patch also changes the code to call reuseport_select_migrated_sock() even
> > > if the listening socket is TCP_CLOSE. If we can pick out a listening 
> > > socket
> > > from the reuseport group, we rewrite request_sock.rsk_listener and resume
> > > processing the request.
> > > 
> > > Reviewed-by: Benjamin Herrenschmidt 
> > > Signed-off-by: Kuniyuki Iwashima 
> > > ---
> > >  include/net/inet_connection_sock.h | 12 +++
> > >  include/net/request_sock.h | 13 
> > >  include/net/sock_reuseport.h   |  8 +++
> > >  net/core/sock_reuseport.c  | 34 --
> > >  net/ipv4/inet_connection_sock.c| 13 ++--
> > >  net/ipv4/tcp_ipv4.c|  9 ++--
> > >  net/ipv6/tcp_ipv6.c|  9 ++--
> > >  7 files changed, 81 insertions(+), 17 deletions(-)
> > > 
> > > diff --git a/include/net/inet_connection_sock.h 
> > > b/include/net/inet_connection_sock.h
> > > index 2ea2d743f8fc..1e0958f5eb21 100644
> > > --- a/include/net/inet_connection_sock.h
> > > +++ b/include/net/inet_connection_sock.h
> > > @@ -272,6 +272,18 @@ static inline void inet_csk_reqsk_queue_added(struct 
> > > sock *sk)
> > >   reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue);
> > >  }
> > >  
> > > +static inline void inet_csk_reqsk_queue_migrated(struct sock *sk,
> > > +  struct sock *nsk,
> > > +  struct request_sock *req)
> > > +{
> > > + reqsk_queue_migrated(&inet_csk(sk)->icsk_accept_queue,
> > > +  &inet_csk(nsk)->icsk_accept_queue,
> > > +  req);
> > > + sock_put(sk);
> > not sure if it is safe to do here.
> > IIUC, when the req->rsk_refcnt is held, it also holds a refcnt
> > to req->rsk_listener such that sock_hold(req->rsk_listener) is
> > safe because its sk_refcnt is not zero.
> 
> I think it is safe to call sock_put() for the old listener here.
> 
> Without this patchset, at receiving the final ACK or retransmitting
> SYN+ACK, if sk_state == TCP_CLOSE, sock_put(req->rsk_listener) is done
> by calling reqsk_put() twice in inet_csk_reqsk_queue_drop_and_put().
Note that in your example (final ACK), sock_put(req->rsk_listener) is
_only_ called when reqsk_put() can get refcount_dec_and_test(&req->rsk_refcnt)
to reach zero.

Here in this patch, it sock_put(req->rsk_listener) without req->rsk_refcnt
reaching zero.

Let says there are two cores holding two refcnt to req (one cnt for each core)
by looking up the req from ehash.  One of the core do this migrate and
sock_put(req->rsk_listener).  Another core does sock_hold(req->rsk_listener).

Core1   Core2
sock_put(req->rsk_listener)

sock_hold(req->rsk_listener)

> And then, we do `goto lookup;` and overwrite the sk.
> 
> In the v2 patchset, refcount_inc_not_zero() is done for the new listener in
> reuseport_select_migrated_sock(), so we have to call sock_put() for the old
> listener instead to free it properly.
> 
> ---8<---
> +struct sock *reuseport_select_migrated_sock(struct sock *sk, u32 hash,
> + struct sk_buff *skb)
> +{
> + struct sock *nsk;
> +
> + nsk = __reuseport_select_sock(sk, hash, skb, 0, 
> BPF_SK_REUSEPORT_MIGRATE_REQUEST);
> + if (nsk && likely(refcount_inc_not_zero(&nsk->sk_refcnt)))
There is another potential issue here.  The TCP_LISTEN nsk is protected
by rcu.  refcount_inc_not_zero(&nsk->sk_refcnt) cannot be done if it
is not under rcu_read_lock().

The receive path may be ok as it is in rcu.  You may need to check for
others.

> + return nsk;
> +
> + return NULL;
> +}
> +EXPORT_SYMBOL(reuseport_select_migrated_sock);
> ---8<---
> https://lore.kernel.org/netdev/20201207132456.65472-8-kun...@amazon.co.jp/
> 
> 
> > > + sock_hold(nsk);
> > > + req->rsk_listener = nsk;
It looks like there is another race here.  What
if multiple cores try to update req->rsk_listener?


Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.

2020-12-10 Thread Martin KaFai Lau
On Thu, Dec 10, 2020 at 02:58:10PM +0900, Kuniyuki Iwashima wrote:

[ ... ]

> > > I've implemented one-by-one migration only for the accept queue for now.
> > > In addition to the concern about TFO queue,
> > You meant this queue:  queue->fastopenq.rskq_rst_head?
> 
> Yes.
> 
> 
> > Can "req" be passed?
> > I did not look up the lock/race in details for that though.
> 
> I think if we rewrite freeing TFO requests part like one of accept queue
> using reqsk_queue_remove(), we can also migrate them.
> 
> In this patchset, selecting a listener for accept queue, the TFO queue of
> the same listener is also migrated to another listener in order to prevent
> TFO spoofing attack.
> 
> If the request in the accept queue is migrated one by one, I am wondering
> which should the request in TFO queue be migrated to prevent attack or
> freed.
> 
> I think user need not know about keeping such requests in kernel to prevent
> attacks, so passing them to eBPF prog is confusing. But, redistributing
> them randomly without user's intention can make some irrelevant listeners
> unnecessarily drop new TFO requests, so this is also bad. Moreover, freeing
> such requests seems not so good in the point of security.
The current behavior (during process restart) is also not carrying this
security queue.  Not carrying them in this patch will make it
less secure than the current behavior during process restart?
Do you need it now or it is something that can be considered for later
without changing uapi bpf.h?

> > > ---8<---
> > > diff --git a/net/ipv4/inet_connection_sock.c 
> > > b/net/ipv4/inet_connection_sock.c
> > > index a82fd4c912be..d0ddd3cb988b 100644
> > > --- a/net/ipv4/inet_connection_sock.c
> > > +++ b/net/ipv4/inet_connection_sock.c
> > > @@ -1001,6 +1001,29 @@ struct sock *inet_csk_reqsk_queue_add(struct sock 
> > > *sk,
> > >  }
> > >  EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
> > >  
> > > +static bool inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock 
> > > *nsk, struct request_sock *req)
> > > +{
> > > +   struct request_sock_queue *queue = 
> > > &inet_csk(nsk)->icsk_accept_queue;
> > > +   bool migrated = false;
> > > +
> > > +   spin_lock(&queue->rskq_lock);
> > > +   if (likely(nsk->sk_state == TCP_LISTEN)) {
> > > +   migrated = true;
> > > +
> > > +   req->dl_next = NULL;
> > > +   if (queue->rskq_accept_head == NULL)
> > > +   WRITE_ONCE(queue->rskq_accept_head, req);
> > > +   else
> > > +   queue->rskq_accept_tail->dl_next = req;
> > > +   queue->rskq_accept_tail = req;
> > > +   sk_acceptq_added(nsk);
> > > +   inet_csk_reqsk_queue_migrated(sk, nsk, req);
> > need to first resolve the question raised in patch 5 regarding
> > to the update on req->rsk_listener though.
> 
> In the unhash path, it is also safe to call sock_put() for the old listner.
> 
> In inet_csk_listen_stop(), the sk_refcnt of the listener >= 1. If the
> listener does not have immature requests, sk_refcnt is 1 and freed in
> __tcp_close().
> 
>   sock_hold(sk) in __tcp_close()
>   sock_put(sk) in inet_csk_destroy_sock()
>   sock_put(sk) in __tcp_clsoe()
I don't see how it is different here than in patch 5.
I could be missing something.

Lets contd the discussion on the other thread (patch 5) first.

> 
> 
> > > +   }
> > > +   spin_unlock(&queue->rskq_lock);
> > > +
> > > +   return migrated;
> > > +}
> > > +
> > >  struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock 
> > > *child,
> > >  struct request_sock *req, bool 
> > > own_req)
> > >  {
> > > @@ -1023,9 +1046,11 @@ EXPORT_SYMBOL(inet_csk_complete_hashdance);
> > >   */
> > >  void inet_csk_listen_stop(struct sock *sk)
> > >  {
> > > +   struct sock_reuseport *reuseport_cb = 
> > > rcu_access_pointer(sk->sk_reuseport_cb);
> > > struct inet_connection_sock *icsk = inet_csk(sk);
> > > struct request_sock_queue *queue = &icsk->icsk_accept_queue;
> > > struct request_sock *next, *req;
> > > +   struct sock *nsk;
> > >  
> > > /* Following specs, it would be better either to send FIN
> > >  * (and enter FIN-WAIT-1, it is normal close)
> > > @@ -1043,8 +1068,19 @@ void inet_csk_listen_stop(struct sock *sk)
> > > WARN_ON(sock_owned_by_user(child));
> > > sock_hold(child);
> > >  
> > > +   if (reuseport_cb) {
> > > +   nsk = reuseport_select_migrated_sock(sk, 
> > > req_to_sk(req)->sk_hash, NULL);
> > > +   if (nsk) {
> > > +   if (inet_csk_reqsk_queue_migrate(sk, nsk, 
> > > req))
> > > +   goto unlock_sock;
> > > +   else
> > > +   sock_put(nsk);
> > > +   }
> > > +   }
> > > +
> > >  

Re: [PATCH v1 bpf-next 05/11] tcp: Migrate TCP_NEW_SYN_RECV requests.

2020-12-14 Thread Martin KaFai Lau
On Tue, Dec 15, 2020 at 02:03:13AM +0900, Kuniyuki Iwashima wrote:
> From:   Martin KaFai Lau 
> Date:   Thu, 10 Dec 2020 10:49:15 -0800
> > On Thu, Dec 10, 2020 at 02:15:38PM +0900, Kuniyuki Iwashima wrote:
> > > From:   Martin KaFai Lau 
> > > Date:   Wed, 9 Dec 2020 16:07:07 -0800
> > > > On Tue, Dec 01, 2020 at 11:44:12PM +0900, Kuniyuki Iwashima wrote:
> > > > > This patch renames reuseport_select_sock() to 
> > > > > __reuseport_select_sock() and
> > > > > adds two wrapper function of it to pass the migration type defined in 
> > > > > the
> > > > > previous commit.
> > > > > 
> > > > >   reuseport_select_sock  : BPF_SK_REUSEPORT_MIGRATE_NO
> > > > >   reuseport_select_migrated_sock : BPF_SK_REUSEPORT_MIGRATE_REQUEST
> > > > > 
> > > > > As mentioned before, we have to select a new listener for 
> > > > > TCP_NEW_SYN_RECV
> > > > > requests at receiving the final ACK or sending a SYN+ACK. Therefore, 
> > > > > this
> > > > > patch also changes the code to call reuseport_select_migrated_sock() 
> > > > > even
> > > > > if the listening socket is TCP_CLOSE. If we can pick out a listening 
> > > > > socket
> > > > > from the reuseport group, we rewrite request_sock.rsk_listener and 
> > > > > resume
> > > > > processing the request.
> > > > > 
> > > > > Reviewed-by: Benjamin Herrenschmidt 
> > > > > Signed-off-by: Kuniyuki Iwashima 
> > > > > ---
> > > > >  include/net/inet_connection_sock.h | 12 +++
> > > > >  include/net/request_sock.h | 13 
> > > > >  include/net/sock_reuseport.h   |  8 +++
> > > > >  net/core/sock_reuseport.c  | 34 
> > > > > --
> > > > >  net/ipv4/inet_connection_sock.c| 13 ++--
> > > > >  net/ipv4/tcp_ipv4.c|  9 ++--
> > > > >  net/ipv6/tcp_ipv6.c|  9 ++--
> > > > >  7 files changed, 81 insertions(+), 17 deletions(-)
> > > > > 
> > > > > diff --git a/include/net/inet_connection_sock.h 
> > > > > b/include/net/inet_connection_sock.h
> > > > > index 2ea2d743f8fc..1e0958f5eb21 100644
> > > > > --- a/include/net/inet_connection_sock.h
> > > > > +++ b/include/net/inet_connection_sock.h
> > > > > @@ -272,6 +272,18 @@ static inline void 
> > > > > inet_csk_reqsk_queue_added(struct sock *sk)
> > > > >   reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue);
> > > > >  }
> > > > >  
> > > > > +static inline void inet_csk_reqsk_queue_migrated(struct sock *sk,
> > > > > +  struct sock *nsk,
> > > > > +  struct request_sock 
> > > > > *req)
> > > > > +{
> > > > > + reqsk_queue_migrated(&inet_csk(sk)->icsk_accept_queue,
> > > > > +  &inet_csk(nsk)->icsk_accept_queue,
> > > > > +  req);
> > > > > + sock_put(sk);
> > > > not sure if it is safe to do here.
> > > > IIUC, when the req->rsk_refcnt is held, it also holds a refcnt
> > > > to req->rsk_listener such that sock_hold(req->rsk_listener) is
> > > > safe because its sk_refcnt is not zero.
> > > 
> > > I think it is safe to call sock_put() for the old listener here.
> > > 
> > > Without this patchset, at receiving the final ACK or retransmitting
> > > SYN+ACK, if sk_state == TCP_CLOSE, sock_put(req->rsk_listener) is done
> > > by calling reqsk_put() twice in inet_csk_reqsk_queue_drop_and_put().
> > Note that in your example (final ACK), sock_put(req->rsk_listener) is
> > _only_ called when reqsk_put() can get 
> > refcount_dec_and_test(&req->rsk_refcnt)
> > to reach zero.
> > 
> > Here in this patch, it sock_put(req->rsk_listener) without req->rsk_refcnt
> > reaching zero.
> > 
> > Let says there are two cores holding two refcnt to req (one cnt for each 
> > core)
> > by looking up the req from ehash.  One of the core do this migrate and
> > sock_put(req->rsk_listener).  Another core does 
> > sock_hold(req->rsk_lis

Re: [PATCH v1 bpf-next 05/11] tcp: Migrate TCP_NEW_SYN_RECV requests.

2020-12-16 Thread Martin KaFai Lau
On Thu, Dec 17, 2020 at 01:41:58AM +0900, Kuniyuki Iwashima wrote:
[ ... ]

> > There may also be places assuming that the req->rsk_listener will never
> > change once it is assigned.  not sure.  have not looked closely yet.
> 
> I have checked this again. There are no functions that expect explicitly
> req->rsk_listener never change except for BUG_ON in inet_child_forget().
> No BUG_ON/WARN_ON does not mean they does not assume listener never
> change, but such functions still work properly if rsk_listener is changed.
The migration not only changes the ptr value of req->rsk_listener, it also
means req is moved to another listener. (e.g. by updating the qlen of
the old sk and new sk)

Lets reuse the example about two cores at the TCP_NEW_SYN_RECV path
racing to finish up the 3WHS.

One core is already at inet_csk_complete_hashdance() doing
"reqsk_queue_removed(&inet_csk(sk)->icsk_accept_queue, req))".
What happen if another core migrates the req to another listener?
Would the "reqsk_queue_removed(&inet_csk(sk)->icsk_accept_queue, req))"
doing thing on the accept_queue that this req no longer belongs to?

Also, from a quick look at reqsk_timer_handler() on how
queue->young and req->num_timeout are updated, I am not sure
the reqsk_queue_migrated() will work also:

+static inline void reqsk_queue_migrated(struct request_sock_queue 
*old_accept_queue,
+   struct request_sock_queue 
*new_accept_queue,
+   const struct request_sock *req)
+{
+   atomic_dec(&old_accept_queue->qlen);
+   atomic_inc(&new_accept_queue->qlen);
+
+   if (req->num_timeout == 0) {
What if reqsk_timer_handler() is running in parallel
and updating req->num_timeout?

+   atomic_dec(&old_accept_queue->young);
+   atomic_inc(&new_accept_queue->young);
+   }
+}


It feels like some of the "own_req" related logic may be useful here.
not sure.  could be something worth to think about.

> 
> 
> > It probably needs some more thoughts here to get a simpler solution.
> 
> Is it fine to move sock_hold() before assigning rsk_listener and defer
> sock_put() to the end of tcp_v[46]_rcv() ?
I don't see how this ordering helps, considering the migration can happen
any time at another core.

> 
> Also, we have to rewrite rsk_listener first and then call sock_put() in
> reqsk_timer_handler() so that rsk_listener always has refcount more than 1.
> 
> ---8<---
>   struct sock *nsk, *osk;
>   bool migrated = false;
>   ...
>   sock_hold(req->rsk_listener);  // (i)
>   sk = req->rsk_listener;
>   ...
>   if (sk->sk_state == TCP_CLOSE) {
>   osk = sk;
>   // do migration without sock_put()
>   sock_hold(nsk);  // (ii) (as with (i))
>   sk = nsk;
>   migrated = true;
>   }
>   ...
>   if (migrated) {
>   sock_put(sk);  // pair with (ii)
>   sock_put(osk); // decrement old listener's refcount
>   sk = osk;
>   }
>   sock_put(sk);  // pair with (i)
> ---8<---


Re: [PATCH bpf-next 1/2] bpf: try to avoid kzalloc in cgroup/{s,g}etsockopt

2020-12-22 Thread Martin KaFai Lau
On Thu, Dec 17, 2020 at 09:23:23AM -0800, Stanislav Fomichev wrote:
> When we attach a bpf program to cgroup/getsockopt any other getsockopt()
> syscall starts incurring kzalloc/kfree cost. While, in general, it's
> not an issue, sometimes it is, like in the case of TCP_ZEROCOPY_RECEIVE.
> TCP_ZEROCOPY_RECEIVE (ab)uses getsockopt system call to implement
> fastpath for incoming TCP, we don't want to have extra allocations in
> there.
> 
> Let add a small buffer on the stack and use it for small (majority)
> {s,g}etsockopt values. I've started with 128 bytes to cover
> the options we care about (TCP_ZEROCOPY_RECEIVE which is 32 bytes
> currently, with some planned extension to 64 + some headroom
> for the future).
> 
> It seems natural to do the same for setsockopt, but it's a bit more
> involved when the BPF program modifies the data (where we have to
> kmalloc). The assumption is that for the majority of setsockopt
> calls (which are doing pure BPF options or apply policy) this
> will bring some benefit as well.
> 
> Signed-off-by: Stanislav Fomichev 
> ---
>  include/linux/filter.h |  3 +++
>  kernel/bpf/cgroup.c| 41 +++--
>  2 files changed, 42 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 29c27656165b..362eb0d7af5d 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1281,6 +1281,8 @@ struct bpf_sysctl_kern {
>   u64 tmp_reg;
>  };
>  
> +#define BPF_SOCKOPT_KERN_BUF_SIZE128
Since these 128 bytes (which then needs to be zero-ed) is modeled after
the TCP_ZEROCOPY_RECEIVE use case, it will be useful to explain
a use case on how the bpf prog will interact with
getsockopt(TCP_ZEROCOPY_RECEIVE).


Re: [PATCH v1 bpf-next 06/11] bpf: Introduce two attach types for BPF_PROG_TYPE_SK_REUSEPORT.

2020-12-02 Thread Martin KaFai Lau
On Tue, Dec 01, 2020 at 06:04:50PM -0800, Andrii Nakryiko wrote:
> On Tue, Dec 1, 2020 at 6:49 AM Kuniyuki Iwashima  wrote:
> >
> > This commit adds new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT to
> > check if the attached eBPF program is capable of migrating sockets.
> >
> > When the eBPF program is attached, the kernel runs it for socket migration
> > only if the expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.
> > The kernel will change the behaviour depending on the returned value:
> >
> >   - SK_PASS with selected_sk, select it as a new listener
> >   - SK_PASS with selected_sk NULL, fall back to the random selection
> >   - SK_DROP, cancel the migration
> >
> > Link: 
> > https://lore.kernel.org/netdev/20201123003828.xjpjdtk4ygl6t...@kafai-mbp.dhcp.thefacebook.com/
> > Suggested-by: Martin KaFai Lau 
> > Signed-off-by: Kuniyuki Iwashima 
> > ---
> >  include/uapi/linux/bpf.h   | 2 ++
> >  kernel/bpf/syscall.c   | 8 
> >  tools/include/uapi/linux/bpf.h | 2 ++
> >  3 files changed, 12 insertions(+)
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 85278deff439..cfc207ae7782 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -241,6 +241,8 @@ enum bpf_attach_type {
> > BPF_XDP_CPUMAP,
> > BPF_SK_LOOKUP,
> > BPF_XDP,
> > +   BPF_SK_REUSEPORT_SELECT,
> > +   BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
> > __MAX_BPF_ATTACH_TYPE
> >  };
> >
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index f3fe9f53f93c..a0796a8de5ea 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -2036,6 +2036,14 @@ bpf_prog_load_check_attach(enum bpf_prog_type 
> > prog_type,
> > if (expected_attach_type == BPF_SK_LOOKUP)
> > return 0;
> > return -EINVAL;
> > +   case BPF_PROG_TYPE_SK_REUSEPORT:
> > +   switch (expected_attach_type) {
> > +   case BPF_SK_REUSEPORT_SELECT:
> > +   case BPF_SK_REUSEPORT_SELECT_OR_MIGRATE:
> > +   return 0;
> > +   default:
> > +   return -EINVAL;
> > +   }
> 
> this is a kernel regression, previously expected_attach_type wasn't
> enforced, so user-space could have provided any number without an
> error.
I also think this change alone will break things like when the usual
attr->expected_attach_type == 0 case.  At least changes is needed in
bpf_prog_load_fixup_attach_type() which is also handling a
similar situation for BPF_PROG_TYPE_CGROUP_SOCK.

I now think there is no need to expose new bpf_attach_type to the UAPI.
Since the prog->expected_attach_type is not used, it can be cleared at load time
and then only set to BPF_SK_REUSEPORT_SELECT_OR_MIGRATE (probably defined
internally at filter.[c|h]) in the is_valid_access() when "migration"
is accessed.  When "migration" is accessed, the bpf prog can handle
migration (and the original not-migration) case.

> 
> > case BPF_PROG_TYPE_EXT:
> > if (expected_attach_type)
> > return -EINVAL;
> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index 85278deff439..cfc207ae7782 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -241,6 +241,8 @@ enum bpf_attach_type {
> > BPF_XDP_CPUMAP,
> > BPF_SK_LOOKUP,
> > BPF_XDP,
> > +   BPF_SK_REUSEPORT_SELECT,
> > +   BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
> > __MAX_BPF_ATTACH_TYPE
> >  };
> >
> > --
> > 2.17.2 (Apple Git-113)
> >


Re: [PATCH v1 bpf-next 06/11] bpf: Introduce two attach types for BPF_PROG_TYPE_SK_REUSEPORT.

2020-12-02 Thread Martin KaFai Lau
On Wed, Dec 02, 2020 at 11:19:02AM -0800, Martin KaFai Lau wrote:
> On Tue, Dec 01, 2020 at 06:04:50PM -0800, Andrii Nakryiko wrote:
> > On Tue, Dec 1, 2020 at 6:49 AM Kuniyuki Iwashima  
> > wrote:
> > >
> > > This commit adds new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT to
> > > check if the attached eBPF program is capable of migrating sockets.
> > >
> > > When the eBPF program is attached, the kernel runs it for socket migration
> > > only if the expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.
> > > The kernel will change the behaviour depending on the returned value:
> > >
> > >   - SK_PASS with selected_sk, select it as a new listener
> > >   - SK_PASS with selected_sk NULL, fall back to the random selection
> > >   - SK_DROP, cancel the migration
> > >
> > > Link: 
> > > https://lore.kernel.org/netdev/20201123003828.xjpjdtk4ygl6t...@kafai-mbp.dhcp.thefacebook.com/
> > > Suggested-by: Martin KaFai Lau 
> > > Signed-off-by: Kuniyuki Iwashima 
> > > ---
> > >  include/uapi/linux/bpf.h   | 2 ++
> > >  kernel/bpf/syscall.c   | 8 
> > >  tools/include/uapi/linux/bpf.h | 2 ++
> > >  3 files changed, 12 insertions(+)
> > >
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index 85278deff439..cfc207ae7782 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -241,6 +241,8 @@ enum bpf_attach_type {
> > > BPF_XDP_CPUMAP,
> > > BPF_SK_LOOKUP,
> > > BPF_XDP,
> > > +   BPF_SK_REUSEPORT_SELECT,
> > > +   BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
> > > __MAX_BPF_ATTACH_TYPE
> > >  };
> > >
> > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > index f3fe9f53f93c..a0796a8de5ea 100644
> > > --- a/kernel/bpf/syscall.c
> > > +++ b/kernel/bpf/syscall.c
> > > @@ -2036,6 +2036,14 @@ bpf_prog_load_check_attach(enum bpf_prog_type 
> > > prog_type,
> > > if (expected_attach_type == BPF_SK_LOOKUP)
> > > return 0;
> > > return -EINVAL;
> > > +   case BPF_PROG_TYPE_SK_REUSEPORT:
> > > +   switch (expected_attach_type) {
> > > +   case BPF_SK_REUSEPORT_SELECT:
> > > +   case BPF_SK_REUSEPORT_SELECT_OR_MIGRATE:
> > > +   return 0;
> > > +   default:
> > > +   return -EINVAL;
> > > +   }
> > 
> > this is a kernel regression, previously expected_attach_type wasn't
> > enforced, so user-space could have provided any number without an
> > error.
> I also think this change alone will break things like when the usual
> attr->expected_attach_type == 0 case.  At least changes is needed in
> bpf_prog_load_fixup_attach_type() which is also handling a
> similar situation for BPF_PROG_TYPE_CGROUP_SOCK.
> 
> I now think there is no need to expose new bpf_attach_type to the UAPI.
> Since the prog->expected_attach_type is not used, it can be cleared at load 
> time
> and then only set to BPF_SK_REUSEPORT_SELECT_OR_MIGRATE (probably defined
> internally at filter.[c|h]) in the is_valid_access() when "migration"
> is accessed.  When "migration" is accessed, the bpf prog can handle
> migration (and the original not-migration) case.
Scrap this internal only BPF_SK_REUSEPORT_SELECT_OR_MIGRATE idea.
I think there will be cases that bpf prog wants to do both
without accessing any field from sk_reuseport_md.

Lets go back to the discussion on using a similar
idea as BPF_PROG_TYPE_CGROUP_SOCK in bpf_prog_load_fixup_attach_type().
I am not aware there is loader setting a random number
in expected_attach_type, so the chance of breaking
is very low.  There was a similar discussion earlier [0].

[0]: https://lore.kernel.org/netdev/20200126045443.f47dzxdglazzchfm@ast-mbp/

> 
> > 
> > > case BPF_PROG_TYPE_EXT:
> > > if (expected_attach_type)
> > > return -EINVAL;
> > > diff --git a/tools/include/uapi/linux/bpf.h 
> > > b/tools/include/uapi/linux/bpf.h
> > > index 85278deff439..cfc207ae7782 100644
> > > --- a/tools/include/uapi/linux/bpf.h
> > > +++ b/tools/include/uapi/linux/bpf.h
> > > @@ -241,6 +241,8 @@ enum bpf_attach_type {
> > > BPF_XDP_CPUMAP,
> > > BPF_SK_LOOKUP,
> > > BPF_XDP,
> > > +   BPF_SK_REUSEPORT_SELECT,
> > > +   BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
> > > __MAX_BPF_ATTACH_TYPE
> > >  };
> > >
> > > --
> > > 2.17.2 (Apple Git-113)
> > >


Re: [PATCH v4 bpf-next 0/2] Add support to set window_clamp from bpf setsockops

2020-12-03 Thread Martin KaFai Lau
On Wed, Dec 02, 2020 at 01:31:50PM -0800, Prankur gupta wrote:
> This patch contains support to set tcp window_field field from bpf setsockops.
> 
> v2: Used TCP_WINDOW_CLAMP setsockopt logic for bpf_setsockopt (review comment 
> addressed)
> 
> v3: Created a common function for duplicated code (review comment addressed)
> 
> v4: Removing logic to pass struct sock and struct tcp_sock together (review 
> comment addressed)
nit.  A short line even for cover letter.

Acked-by: Martin KaFai Lau 


Re: [PATCH bpf-next v4 5/6] bpf: Add an iterator selftest for bpf_sk_storage_get

2020-12-03 Thread Martin KaFai Lau
On Wed, Dec 02, 2020 at 09:55:26PM +0100, Florent Revest wrote:
> The eBPF program iterates over all files and tasks. For all socket
> files, it stores the tgid of the last task it encountered with a handle
> to that socket. This is a heuristic for finding the "owner" of a socket
> similar to what's done by lsof, ss, netstat or fuser. Potentially, this
> information could be used from a cgroup_skb/*gress hook to try to
> associate network traffic with processes.
> 
> The test makes sure that a socket it created is tagged with prog_tests's
> pid.
Acked-by: Martin KaFai Lau 


Re: [PATCH bpf-next v4 6/6] bpf: Test bpf_sk_storage_get in tcp iterators

2020-12-03 Thread Martin KaFai Lau
On Wed, Dec 02, 2020 at 09:55:27PM +0100, Florent Revest wrote:
> This extends the existing bpf_sk_storage_get test where a socket is
> created and tagged with its creator's pid by a task_file iterator.
> 
> A TCP iterator is now also used at the end of the test to negate the
> values already stored in the local storage. The test therefore expects
> -getpid() to be stored in the local storage.
> 
> Signed-off-by: Florent Revest 
> Acked-by: Yonghong Song 
> ---
>  .../selftests/bpf/prog_tests/bpf_iter.c| 13 +
>  .../progs/bpf_iter_bpf_sk_storage_helpers.c| 18 ++
>  2 files changed, 31 insertions(+)
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c 
> b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
> index 9336d0f18331..b8362147c9e3 100644
> --- a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
> +++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
> @@ -978,6 +978,8 @@ static void test_bpf_sk_storage_delete(void)
>  /* This creates a socket and its local storage. It then runs a task_iter BPF
>   * program that replaces the existing socket local storage with the tgid of 
> the
>   * only task owning a file descriptor to this socket, this process, 
> prog_tests.
> + * It then runs a tcp socket iterator that negates the value in the existing
> + * socket local storage, the test verifies that the resulting value is -pid.
>   */
>  static void test_bpf_sk_storage_get(void)
>  {
> @@ -994,6 +996,10 @@ static void test_bpf_sk_storage_get(void)
>   if (CHECK(sock_fd < 0, "socket", "errno: %d\n", errno))
>   goto out;
>  
> + err = listen(sock_fd, 1);
> + if (CHECK(err != 0, "listen", "errno: %d\n", errno))
> + goto out;

goto close_socket;

> +
>   map_fd = bpf_map__fd(skel->maps.sk_stg_map);
>  
>   err = bpf_map_update_elem(map_fd, &sock_fd, &val, BPF_NOEXIST);
> @@ -1007,6 +1013,13 @@ static void test_bpf_sk_storage_get(void)
> "map value wasn't set correctly (expected %d, got %d, err=%d)\n",
> getpid(), val, err);
The failure of this CHECK here should "goto close_socket;" now.

Others LGTM.

Acked-by: Martin KaFai Lau 

>  
> + do_dummy_read(skel->progs.negate_socket_local_storage);
> +
> + err = bpf_map_lookup_elem(map_fd, &sock_fd, &val);
> + CHECK(err || val != -getpid(), "bpf_map_lookup_elem",
> +   "map value wasn't set correctly (expected %d, got %d, err=%d)\n",
> +   -getpid(), val, err);
> +
>  close_socket:
>   close(sock_fd);
>  out:


Re: [PATCH v1 bpf-next 06/11] bpf: Introduce two attach types for BPF_PROG_TYPE_SK_REUSEPORT.

2020-12-03 Thread Martin KaFai Lau
On Thu, Dec 03, 2020 at 11:16:08PM +0900, Kuniyuki Iwashima wrote:
> From:   Martin KaFai Lau 
> Date:   Wed, 2 Dec 2020 20:24:02 -0800
> > On Wed, Dec 02, 2020 at 11:19:02AM -0800, Martin KaFai Lau wrote:
> > > On Tue, Dec 01, 2020 at 06:04:50PM -0800, Andrii Nakryiko wrote:
> > > > On Tue, Dec 1, 2020 at 6:49 AM Kuniyuki Iwashima  
> > > > wrote:
> > > > >
> > > > > This commit adds new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT to
> > > > > check if the attached eBPF program is capable of migrating sockets.
> > > > >
> > > > > When the eBPF program is attached, the kernel runs it for socket 
> > > > > migration
> > > > > only if the expected_attach_type is 
> > > > > BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.
> > > > > The kernel will change the behaviour depending on the returned value:
> > > > >
> > > > >   - SK_PASS with selected_sk, select it as a new listener
> > > > >   - SK_PASS with selected_sk NULL, fall back to the random selection
> > > > >   - SK_DROP, cancel the migration
> > > > >
> > > > > Link: 
> > > > > https://lore.kernel.org/netdev/20201123003828.xjpjdtk4ygl6t...@kafai-mbp.dhcp.thefacebook.com/
> > > > > Suggested-by: Martin KaFai Lau 
> > > > > Signed-off-by: Kuniyuki Iwashima 
> > > > > ---
> > > > >  include/uapi/linux/bpf.h   | 2 ++
> > > > >  kernel/bpf/syscall.c   | 8 
> > > > >  tools/include/uapi/linux/bpf.h | 2 ++
> > > > >  3 files changed, 12 insertions(+)
> > > > >
> > > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > > > index 85278deff439..cfc207ae7782 100644
> > > > > --- a/include/uapi/linux/bpf.h
> > > > > +++ b/include/uapi/linux/bpf.h
> > > > > @@ -241,6 +241,8 @@ enum bpf_attach_type {
> > > > > BPF_XDP_CPUMAP,
> > > > > BPF_SK_LOOKUP,
> > > > > BPF_XDP,
> > > > > +   BPF_SK_REUSEPORT_SELECT,
> > > > > +   BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
> > > > > __MAX_BPF_ATTACH_TYPE
> > > > >  };
> > > > >
> > > > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > > > index f3fe9f53f93c..a0796a8de5ea 100644
> > > > > --- a/kernel/bpf/syscall.c
> > > > > +++ b/kernel/bpf/syscall.c
> > > > > @@ -2036,6 +2036,14 @@ bpf_prog_load_check_attach(enum bpf_prog_type 
> > > > > prog_type,
> > > > > if (expected_attach_type == BPF_SK_LOOKUP)
> > > > > return 0;
> > > > > return -EINVAL;
> > > > > +   case BPF_PROG_TYPE_SK_REUSEPORT:
> > > > > +   switch (expected_attach_type) {
> > > > > +   case BPF_SK_REUSEPORT_SELECT:
> > > > > +   case BPF_SK_REUSEPORT_SELECT_OR_MIGRATE:
> > > > > +   return 0;
> > > > > +   default:
> > > > > +   return -EINVAL;
> > > > > +   }
> > > > 
> > > > this is a kernel regression, previously expected_attach_type wasn't
> > > > enforced, so user-space could have provided any number without an
> > > > error.
> > > I also think this change alone will break things like when the usual
> > > attr->expected_attach_type == 0 case.  At least changes is needed in
> > > bpf_prog_load_fixup_attach_type() which is also handling a
> > > similar situation for BPF_PROG_TYPE_CGROUP_SOCK.
> > > 
> > > I now think there is no need to expose new bpf_attach_type to the UAPI.
> > > Since the prog->expected_attach_type is not used, it can be cleared at 
> > > load time
> > > and then only set to BPF_SK_REUSEPORT_SELECT_OR_MIGRATE (probably defined
> > > internally at filter.[c|h]) in the is_valid_access() when "migration"
> > > is accessed.  When "migration" is accessed, the bpf prog can handle
> > > migration (and the original not-migration) case.
> > Scrap this internal only BPF_SK_REUSEPORT_SELECT_OR_MIGRATE idea.
> > I think there will be cases that bpf prog wants to do both
> > without accessing any field from sk_reuseport_md.
> > 
> > Lets go back to the discussion on using a similar
> > idea as BPF_PROG_TYPE_CGROUP_SOCK in bpf_prog_load_fixup_attach_type().
> > I am not aware there is loader setting a random number
> > in expected_attach_type, so the chance of breaking
> > is very low.  There was a similar discussion earlier [0].
> > 
> > [0]: https://lore.kernel.org/netdev/20200126045443.f47dzxdglazzchfm@ast-mbp/
> 
> Thank you for the idea and reference.
> 
> I will remove the change in bpf_prog_load_check_attach() and set the
> default value (BPF_SK_REUSEPORT_SELECT) in bpf_prog_load_fixup_attach_type()
> for backward compatibility if expected_attach_type is 0.
check_attach_type() can be kept.  You can refer to
commit aac3fc320d94 for a similar situation.


Re: [PATCH v1 bpf-next 09/11] bpf: Support bpf_get_socket_cookie_sock() for BPF_PROG_TYPE_SK_REUSEPORT.

2020-12-04 Thread Martin KaFai Lau
On Tue, Dec 01, 2020 at 11:44:16PM +0900, Kuniyuki Iwashima wrote:
> We will call sock_reuseport.prog for socket migration in the next commit,
> so the eBPF program has to know which listener is closing in order to
> select the new listener.
> 
> Currently, we can get a unique ID for each listener in the userspace by
> calling bpf_map_lookup_elem() for BPF_MAP_TYPE_REUSEPORT_SOCKARRAY map.
> 
> This patch makes the sk pointer available in sk_reuseport_md so that we can
> get the ID by BPF_FUNC_get_socket_cookie() in the eBPF program.
> 
> Link: 
> https://lore.kernel.org/netdev/20201119001154.kapwihc2plp4f...@kafai-mbp.dhcp.thefacebook.com/
> Suggested-by: Martin KaFai Lau 
> Signed-off-by: Kuniyuki Iwashima 
> ---
>  include/uapi/linux/bpf.h   |  8 
>  net/core/filter.c  | 12 +++-
>  tools/include/uapi/linux/bpf.h |  8 
>  3 files changed, 27 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index efe342bf3dbc..3e9b8bd42b4e 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1650,6 +1650,13 @@ union bpf_attr {
>   *   A 8-byte long non-decreasing number on success, or 0 if the
>   *   socket field is missing inside *skb*.
>   *
> + * u64 bpf_get_socket_cookie(struct bpf_sock *sk)
> + *   Description
> + *   Equivalent to bpf_get_socket_cookie() helper that accepts
> + *   *skb*, but gets socket from **struct bpf_sock** context.
> + *   Return
> + *   A 8-byte long non-decreasing number.
> + *
>   * u64 bpf_get_socket_cookie(struct bpf_sock_addr *ctx)
>   *   Description
>   *   Equivalent to bpf_get_socket_cookie() helper that accepts
> @@ -4420,6 +4427,7 @@ struct sk_reuseport_md {
>   __u32 bind_inany;   /* Is sock bound to an INANY address? */
>   __u32 hash; /* A hash of the packet 4 tuples */
>   __u8 migration; /* Migration type */
> + __bpf_md_ptr(struct bpf_sock *, sk); /* current listening socket */
>  };
>  
>  #define BPF_TAG_SIZE 8
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 0a0634787bb4..1059d31847ef 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -4628,7 +4628,7 @@ static const struct bpf_func_proto 
> bpf_get_socket_cookie_sock_proto = {
>   .func   = bpf_get_socket_cookie_sock,
>   .gpl_only   = false,
>   .ret_type   = RET_INTEGER,
> - .arg1_type  = ARG_PTR_TO_CTX,
> + .arg1_type  = ARG_PTR_TO_SOCKET,
This will break existing bpf prog (BPF_PROG_TYPE_CGROUP_SOCK)
using this proto.  A new proto is needed and there is
an on-going patch doing this [0].

[0]: https://lore.kernel.org/bpf/20201203213330.1657666-1-rev...@google.com/


Re: [PATCH v1 bpf-next 01/11] tcp: Keep TCP_CLOSE sockets in the reuseport group.

2020-12-04 Thread Martin KaFai Lau
On Tue, Dec 01, 2020 at 11:44:08PM +0900, Kuniyuki Iwashima wrote:
> This patch is a preparation patch to migrate incoming connections in the
> later commits and adds a field (num_closed_socks) to the struct
> sock_reuseport to keep TCP_CLOSE sockets in the reuseport group.
> 
> When we close a listening socket, to migrate its connections to another
> listener in the same reuseport group, we have to handle two kinds of child
> sockets. One is that a listening socket has a reference to, and the other
> is not.
> 
> The former is the TCP_ESTABLISHED/TCP_SYN_RECV sockets, and they are in the
> accept queue of their listening socket. So, we can pop them out and push
> them into another listener's queue at close() or shutdown() syscalls. On
> the other hand, the latter, the TCP_NEW_SYN_RECV socket is during the
> three-way handshake and not in the accept queue. Thus, we cannot access
> such sockets at close() or shutdown() syscalls. Accordingly, we have to
> migrate immature sockets after their listening socket has been closed.
> 
> Currently, if their listening socket has been closed, TCP_NEW_SYN_RECV
> sockets are freed at receiving the final ACK or retransmitting SYN+ACKs. At
> that time, if we could select a new listener from the same reuseport group,
> no connection would be aborted. However, it is impossible because
> reuseport_detach_sock() sets NULL to sk_reuseport_cb and forbids access to
> the reuseport group from closed sockets.
> 
> This patch allows TCP_CLOSE sockets to remain in the reuseport group and to
> have access to it while any child socket references to them. The point is
> that reuseport_detach_sock() is called twice from inet_unhash() and
> sk_destruct(). At first, it moves the socket backwards in socks[] and
> increments num_closed_socks. Later, when all migrated connections are
> accepted, it removes the socket from socks[], decrements num_closed_socks,
> and sets NULL to sk_reuseport_cb.
> 
> By this change, closed sockets can keep sk_reuseport_cb until all child
> requests have been freed or accepted. Consequently calling listen() after
> shutdown() can cause EADDRINUSE or EBUSY in reuseport_add_sock() or
> inet_csk_bind_conflict() which expect that such sockets should not have the
> reuseport group. Therefore, this patch also loosens such validation rules
> so that the socket can listen again if it has the same reuseport group with
> other listening sockets.
> 
> Reviewed-by: Benjamin Herrenschmidt 
> Signed-off-by: Kuniyuki Iwashima 
> ---
>  include/net/sock_reuseport.h|  5 ++-
>  net/core/sock_reuseport.c   | 79 +++--
>  net/ipv4/inet_connection_sock.c |  7 ++-
>  3 files changed, 74 insertions(+), 17 deletions(-)
> 
> diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
> index 505f1e18e9bf..0e558ca7afbf 100644
> --- a/include/net/sock_reuseport.h
> +++ b/include/net/sock_reuseport.h
> @@ -13,8 +13,9 @@ extern spinlock_t reuseport_lock;
>  struct sock_reuseport {
>   struct rcu_head rcu;
>  
> - u16 max_socks;  /* length of socks */
> - u16 num_socks;  /* elements in socks */
> + u16 max_socks;  /* length of socks */
> + u16 num_socks;  /* elements in socks */
> + u16 num_closed_socks;   /* closed elements in 
> socks */
>   /* The last synq overflow event timestamp of this
>* reuse->socks[] group.
>*/
> diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
> index bbdd3c7b6cb5..fd133516ac0e 100644
> --- a/net/core/sock_reuseport.c
> +++ b/net/core/sock_reuseport.c
> @@ -98,16 +98,21 @@ static struct sock_reuseport *reuseport_grow(struct 
> sock_reuseport *reuse)
>   return NULL;
>  
>   more_reuse->num_socks = reuse->num_socks;
> + more_reuse->num_closed_socks = reuse->num_closed_socks;
>   more_reuse->prog = reuse->prog;
>   more_reuse->reuseport_id = reuse->reuseport_id;
>   more_reuse->bind_inany = reuse->bind_inany;
>   more_reuse->has_conns = reuse->has_conns;
> + more_reuse->synq_overflow_ts = READ_ONCE(reuse->synq_overflow_ts);
>  
>   memcpy(more_reuse->socks, reuse->socks,
>  reuse->num_socks * sizeof(struct sock *));
> - more_reuse->synq_overflow_ts = READ_ONCE(reuse->synq_overflow_ts);
> + memcpy(more_reuse->socks +
> +(more_reuse->max_socks - more_reuse->num_closed_socks),
> +reuse->socks + reuse->num_socks,
> +reuse->num_closed_socks * sizeof(struct sock *));
>  
> - for (i = 0; i < reuse->num_socks; ++i)
> + for (i = 0; i < reuse->max_socks; ++i)
>   rcu_assign_pointer(reuse->socks[i]->sk_reuseport_cb,
>  more_reuse);
>  
> @@ -129,6 +134,25 @@ static void reuseport_free_rcu(struct rcu_head *head)
>   kfree(reuse);
>  }
>  
> +static int reuseport_sock_index(

Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.

2020-12-04 Thread Martin KaFai Lau
On Tue, Dec 01, 2020 at 11:44:10PM +0900, Kuniyuki Iwashima wrote:
[ ... ]
> diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
> index fd133516ac0e..60d7c1f28809 100644
> --- a/net/core/sock_reuseport.c
> +++ b/net/core/sock_reuseport.c
> @@ -216,9 +216,11 @@ int reuseport_add_sock(struct sock *sk, struct sock 
> *sk2, bool bind_inany)
>  }
>  EXPORT_SYMBOL(reuseport_add_sock);
>  
> -void reuseport_detach_sock(struct sock *sk)
> +struct sock *reuseport_detach_sock(struct sock *sk)
>  {
>   struct sock_reuseport *reuse;
> + struct bpf_prog *prog;
> + struct sock *nsk = NULL;
>   int i;
>  
>   spin_lock_bh(&reuseport_lock);
> @@ -242,8 +244,12 @@ void reuseport_detach_sock(struct sock *sk)
>  
>   reuse->num_socks--;
>   reuse->socks[i] = reuse->socks[reuse->num_socks];
> + prog = rcu_dereference(reuse->prog);
Is it under rcu_read_lock() here?

>  
>   if (sk->sk_protocol == IPPROTO_TCP) {
> + if (reuse->num_socks && !prog)
> + nsk = i == reuse->num_socks ? reuse->socks[i - 
> 1] : reuse->socks[i];
> +
>   reuse->num_closed_socks++;
>   reuse->socks[reuse->max_socks - 
> reuse->num_closed_socks] = sk;
>   } else {
> @@ -264,6 +270,8 @@ void reuseport_detach_sock(struct sock *sk)
>   call_rcu(&reuse->rcu, reuseport_free_rcu);
>  out:
>   spin_unlock_bh(&reuseport_lock);
> +
> + return nsk;
>  }
>  EXPORT_SYMBOL(reuseport_detach_sock);
>  
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index 1451aa9712b0..b27241ea96bd 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -992,6 +992,36 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
>  }
>  EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
>  
> +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
> +{
> + struct request_sock_queue *old_accept_queue, *new_accept_queue;
> +
> + old_accept_queue = &inet_csk(sk)->icsk_accept_queue;
> + new_accept_queue = &inet_csk(nsk)->icsk_accept_queue;
> +
> + spin_lock(&old_accept_queue->rskq_lock);
> + spin_lock(&new_accept_queue->rskq_lock);
I am also not very thrilled on this double spin_lock.
Can this be done in (or like) inet_csk_listen_stop() instead?

> +
> + if (old_accept_queue->rskq_accept_head) {
> + if (new_accept_queue->rskq_accept_head)
> + old_accept_queue->rskq_accept_tail->dl_next =
> + new_accept_queue->rskq_accept_head;
> + else
> + new_accept_queue->rskq_accept_tail = 
> old_accept_queue->rskq_accept_tail;
> +
> + new_accept_queue->rskq_accept_head = 
> old_accept_queue->rskq_accept_head;
> + old_accept_queue->rskq_accept_head = NULL;
> + old_accept_queue->rskq_accept_tail = NULL;
> +
> + WRITE_ONCE(nsk->sk_ack_backlog, nsk->sk_ack_backlog + 
> sk->sk_ack_backlog);
> + WRITE_ONCE(sk->sk_ack_backlog, 0);
> + }
> +
> + spin_unlock(&new_accept_queue->rskq_lock);
> + spin_unlock(&old_accept_queue->rskq_lock);
> +}
> +EXPORT_SYMBOL(inet_csk_reqsk_queue_migrate);
> +
>  struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child,
>struct request_sock *req, bool own_req)
>  {
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index 45fb450b4522..545538a6bfac 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -681,6 +681,7 @@ void inet_unhash(struct sock *sk)
>  {
>   struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
>   struct inet_listen_hashbucket *ilb = NULL;
> + struct sock *nsk;
>   spinlock_t *lock;
>  
>   if (sk_unhashed(sk))
> @@ -696,8 +697,12 @@ void inet_unhash(struct sock *sk)
>   if (sk_unhashed(sk))
>   goto unlock;
>  
> - if (rcu_access_pointer(sk->sk_reuseport_cb))
> - reuseport_detach_sock(sk);
> + if (rcu_access_pointer(sk->sk_reuseport_cb)) {
> + nsk = reuseport_detach_sock(sk);
> + if (nsk)
> + inet_csk_reqsk_queue_migrate(sk, nsk);
> + }
> +
>   if (ilb) {
>   inet_unhash2(hashinfo, sk);
>   ilb->count--;
> -- 
> 2.17.2 (Apple Git-113)
> 


Re: [PATCH v1 bpf-next 11/11] bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.

2020-12-04 Thread Martin KaFai Lau
On Tue, Dec 01, 2020 at 11:44:18PM +0900, Kuniyuki Iwashima wrote:
> This patch adds a test for BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.
> 
> Reviewed-by: Benjamin Herrenschmidt 
> Signed-off-by: Kuniyuki Iwashima 
> ---
>  .../bpf/prog_tests/migrate_reuseport.c| 164 ++
>  .../bpf/progs/test_migrate_reuseport_kern.c   |  54 ++
>  2 files changed, 218 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c
>  create mode 100644 
> tools/testing/selftests/bpf/progs/test_migrate_reuseport_kern.c
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c 
> b/tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c
> new file mode 100644
> index ..87c72d9ccadd
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c
> @@ -0,0 +1,164 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Check if we can migrate child sockets.
> + *
> + *   1. call listen() for 5 server sockets.
> + *   2. update a map to migrate all child socket
> + *to the last server socket (migrate_map[cookie] = 4)
> + *   3. call connect() for 25 client sockets.
> + *   4. call close() for first 4 server sockets.
> + *   5. call accept() for the last server socket.
> + *
> + * Author: Kuniyuki Iwashima 
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define NUM_SOCKS 5
> +#define LOCALHOST "127.0.0.1"
> +#define err_exit(condition, message)   \
> + do {  \
> + if (condition) {  \
> + perror("ERROR: " message " ");\
> + exit(1);  \
> + } \
> + } while (0)
> +
> +__u64 server_fds[NUM_SOCKS];
> +int prog_fd, reuseport_map_fd, migrate_map_fd;
> +
> +
> +void setup_bpf(void)
> +{
> + struct bpf_object *obj;
> + struct bpf_program *prog;
> + struct bpf_map *reuseport_map, *migrate_map;
> + int err;
> +
> + obj = bpf_object__open("test_migrate_reuseport_kern.o");
> + err_exit(libbpf_get_error(obj), "opening BPF object file failed");
> +
> + err = bpf_object__load(obj);
> + err_exit(err, "loading BPF object failed");
> +
> + prog = bpf_program__next(NULL, obj);
> + err_exit(!prog, "loading BPF program failed");
> +
> + reuseport_map = bpf_object__find_map_by_name(obj, "reuseport_map");
> + err_exit(!reuseport_map, "loading BPF reuseport_map failed");
> +
> + migrate_map = bpf_object__find_map_by_name(obj, "migrate_map");
> + err_exit(!migrate_map, "loading BPF migrate_map failed");
> +
> + prog_fd = bpf_program__fd(prog);
> + reuseport_map_fd = bpf_map__fd(reuseport_map);
> + migrate_map_fd = bpf_map__fd(migrate_map);
> +}
> +
> +void test_listen(void)
> +{
> + struct sockaddr_in addr;
> + socklen_t addr_len = sizeof(addr);
> + int i, err, optval = 1, migrated_to = NUM_SOCKS - 1;
> + __u64 value;
> +
> + addr.sin_family = AF_INET;
> + addr.sin_port = htons(80);
> + inet_pton(AF_INET, LOCALHOST, &addr.sin_addr.s_addr);
> +
> + for (i = 0; i < NUM_SOCKS; i++) {
> + server_fds[i] = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
> + err_exit(server_fds[i] == -1, "socket() for listener sockets 
> failed");
> +
> + err = setsockopt(server_fds[i], SOL_SOCKET, SO_REUSEPORT,
> +  &optval, sizeof(optval));
> + err_exit(err == -1, "setsockopt() for SO_REUSEPORT failed");
> +
> + if (i == 0) {
> + err = setsockopt(server_fds[i], SOL_SOCKET, 
> SO_ATTACH_REUSEPORT_EBPF,
> +  &prog_fd, sizeof(prog_fd));
> + err_exit(err == -1, "setsockopt() for 
> SO_ATTACH_REUSEPORT_EBPF failed");
> + }
> +
> + err = bind(server_fds[i], (struct sockaddr *)&addr, addr_len);
> + err_exit(err == -1, "bind() failed");
> +
> + err = listen(server_fds[i], 32);
> + err_exit(err == -1, "listen() failed");
> +
> + err = bpf_map_update_elem(reuseport_map_fd, &i, &server_fds[i], 
> BPF_NOEXIST);
> + err_exit(err == -1, "updating BPF reuseport_map failed");
> +
> + err = bpf_map_lookup_elem(reuseport_map_fd, &i, &value);
> + err_exit(err == -1, "looking up BPF reuseport_map failed");
> +
> + printf("fd[%d] (cookie: %llu) -> fd[%d]\n", i, value, 
> migrated_to);
> + err = bpf_map_update_elem(migrate_map_fd, &value, &migrated_to, 
> BPF_NOEXIST);
> + err_exit(err == -1, "updating BPF migrate_map failed");
> + }
> +}
> +
> +void test_connect(void)
> +{
> + struct sockaddr_in addr;
> + socklen_t addr_len = sizeof(addr)

Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.

2020-12-07 Thread Martin KaFai Lau
On Sun, Dec 06, 2020 at 01:03:07AM +0900, Kuniyuki Iwashima wrote:
> From:   Martin KaFai Lau 
> Date:   Fri, 4 Dec 2020 17:42:41 -0800
> > On Tue, Dec 01, 2020 at 11:44:10PM +0900, Kuniyuki Iwashima wrote:
> > [ ... ]
> > > diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
> > > index fd133516ac0e..60d7c1f28809 100644
> > > --- a/net/core/sock_reuseport.c
> > > +++ b/net/core/sock_reuseport.c
> > > @@ -216,9 +216,11 @@ int reuseport_add_sock(struct sock *sk, struct sock 
> > > *sk2, bool bind_inany)
> > >  }
> > >  EXPORT_SYMBOL(reuseport_add_sock);
> > >  
> > > -void reuseport_detach_sock(struct sock *sk)
> > > +struct sock *reuseport_detach_sock(struct sock *sk)
> > >  {
> > >   struct sock_reuseport *reuse;
> > > + struct bpf_prog *prog;
> > > + struct sock *nsk = NULL;
> > >   int i;
> > >  
> > >   spin_lock_bh(&reuseport_lock);
> > > @@ -242,8 +244,12 @@ void reuseport_detach_sock(struct sock *sk)
> > >  
> > >   reuse->num_socks--;
> > >   reuse->socks[i] = reuse->socks[reuse->num_socks];
> > > + prog = rcu_dereference(reuse->prog);
> > Is it under rcu_read_lock() here?
> 
> reuseport_lock is locked in this function, and we do not modify the prog,
> but is rcu_dereference_protected() preferable?
> 
> ---8<---
> prog = rcu_dereference_protected(reuse->prog,
>lockdep_is_held(&reuseport_lock));
> ---8<---
It is not only reuse->prog.  Other things also require rcu_read_lock(),
e.g. please take a look at __htab_map_lookup_elem().

The TCP_LISTEN sk (selected by bpf to be the target of the migration)
is also protected by rcu.

I am surprised there is no WARNING in the test.
Do you have the needed DEBUG_LOCK* config enabled?

> > >   if (sk->sk_protocol == IPPROTO_TCP) {
> > > + if (reuse->num_socks && !prog)
> > > + nsk = i == reuse->num_socks ? reuse->socks[i - 
> > > 1] : reuse->socks[i];
> > > +
> > >   reuse->num_closed_socks++;
> > >   reuse->socks[reuse->max_socks - 
> > > reuse->num_closed_socks] = sk;
> > >   } else {
> > > @@ -264,6 +270,8 @@ void reuseport_detach_sock(struct sock *sk)
> > >   call_rcu(&reuse->rcu, reuseport_free_rcu);
> > >  out:
> > >   spin_unlock_bh(&reuseport_lock);
> > > +
> > > + return nsk;
> > >  }
> > >  EXPORT_SYMBOL(reuseport_detach_sock);
> > >  
> > > diff --git a/net/ipv4/inet_connection_sock.c 
> > > b/net/ipv4/inet_connection_sock.c
> > > index 1451aa9712b0..b27241ea96bd 100644
> > > --- a/net/ipv4/inet_connection_sock.c
> > > +++ b/net/ipv4/inet_connection_sock.c
> > > @@ -992,6 +992,36 @@ struct sock *inet_csk_reqsk_queue_add(struct sock 
> > > *sk,
> > >  }
> > >  EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
> > >  
> > > +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
> > > +{
> > > + struct request_sock_queue *old_accept_queue, *new_accept_queue;
> > > +
> > > + old_accept_queue = &inet_csk(sk)->icsk_accept_queue;
> > > + new_accept_queue = &inet_csk(nsk)->icsk_accept_queue;
> > > +
> > > + spin_lock(&old_accept_queue->rskq_lock);
> > > + spin_lock(&new_accept_queue->rskq_lock);
> > I am also not very thrilled on this double spin_lock.
> > Can this be done in (or like) inet_csk_listen_stop() instead?
> 
> It will be possible to migrate sockets in inet_csk_listen_stop(), but I
> think it is better to do it just after reuseport_detach_sock() becuase we
> can select a different listener (almost) every time at a lower cost by
> selecting the moved socket and pass it to inet_csk_reqsk_queue_migrate()
> easily.
I don't see the "lower cost" point.  Please elaborate.

> 
> sk_hash of the listener is 0, so we would have to generate a random number
> in inet_csk_listen_stop().
If I read it correctly, it is also passing 0 as the sk_hash to
bpf_run_sk_reuseport() from reuseport_detach_sock().

Also, how is the sk_hash expected to be used?  I don't see
it in the test.


Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.

2020-12-07 Thread Martin KaFai Lau
On Thu, Dec 03, 2020 at 11:14:24PM +0900, Kuniyuki Iwashima wrote:
> From:   Eric Dumazet 
> Date:   Tue, 1 Dec 2020 16:25:51 +0100
> > On 12/1/20 3:44 PM, Kuniyuki Iwashima wrote:
> > > This patch lets reuseport_detach_sock() return a pointer of struct sock,
> > > which is used only by inet_unhash(). If it is not NULL,
> > > inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
> > > sockets from the closing listener to the selected one.
> > > 
> > > Listening sockets hold incoming connections as a linked list of struct
> > > request_sock in the accept queue, and each request has reference to a full
> > > socket and its listener. In inet_csk_reqsk_queue_migrate(), we only unlink
> > > the requests from the closing listener's queue and relink them to the head
> > > of the new listener's queue. We do not process each request and its
> > > reference to the listener, so the migration completes in O(1) time
> > > complexity. However, in the case of TCP_SYN_RECV sockets, we take special
> > > care in the next commit.
> > > 
> > > By default, the kernel selects a new listener randomly. In order to pick
> > > out a different socket every time, we select the last element of socks[] 
> > > as
> > > the new listener. This behaviour is based on how the kernel moves sockets
> > > in socks[]. (See also [1])
> > > 
> > > Basically, in order to redistribute sockets evenly, we have to use an eBPF
> > > program called in the later commit, but as the side effect of such default
> > > selection, the kernel can redistribute old requests evenly to new 
> > > listeners
> > > for a specific case where the application replaces listeners by
> > > generations.
> > > 
> > > For example, we call listen() for four sockets (A, B, C, D), and close the
> > > first two by turns. The sockets move in socks[] like below.
> > > 
> > >   socks[0] : A <-.  socks[0] : D  socks[0] : D
> > >   socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
> > >   socks[2] : C   |  socks[2] : C --'
> > >   socks[3] : D --'
> > > 
> > > Then, if C and D have newer settings than A and B, and each socket has a
> > > request (a, b, c, d) in their accept queue, we can redistribute old
> > > requests evenly to new listeners.
> > > 
> > >   socks[0] : A (a) <-.  socks[0] : D (a + d)  socks[0] : D (a + d)
> > >   socks[1] : B (b)   |  =>  socks[1] : B (b) <-.  =>  socks[1] : C (b + c)
> > >   socks[2] : C (c)   |  socks[2] : C (c) --'
> > >   socks[3] : D (d) --'
> > > 
> > > Here, (A, D) or (B, C) can have different application settings, but they
> > > MUST have the same settings at the socket API level; otherwise, unexpected
> > > error may happen. For instance, if only the new listeners have
> > > TCP_SAVE_SYN, old requests do not have SYN data, so the application will
> > > face inconsistency and cause an error.
> > > 
> > > Therefore, if there are different kinds of sockets, we must attach an eBPF
> > > program described in later commits.
> > > 
> > > Link: 
> > > https://lore.kernel.org/netdev/CAEfhGiyG8Y_amDZ2C8dQoQqjZJMHjTY76b=KBkTKcBtA=dh...@mail.gmail.com/
> > > Reviewed-by: Benjamin Herrenschmidt 
> > > Signed-off-by: Kuniyuki Iwashima 
> > > ---
> > >  include/net/inet_connection_sock.h |  1 +
> > >  include/net/sock_reuseport.h   |  2 +-
> > >  net/core/sock_reuseport.c  | 10 +-
> > >  net/ipv4/inet_connection_sock.c| 30 ++
> > >  net/ipv4/inet_hashtables.c |  9 +++--
> > >  5 files changed, 48 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/include/net/inet_connection_sock.h 
> > > b/include/net/inet_connection_sock.h
> > > index 7338b3865a2a..2ea2d743f8fc 100644
> > > --- a/include/net/inet_connection_sock.h
> > > +++ b/include/net/inet_connection_sock.h
> > > @@ -260,6 +260,7 @@ struct dst_entry *inet_csk_route_child_sock(const 
> > > struct sock *sk,
> > >  struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> > > struct request_sock *req,
> > > struct sock *child);
> > > +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk);
> > >  void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock 
> > > *req,
> > >  unsigned long timeout);
> > >  struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock 
> > > *child,
> > > diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
> > > index 0e558ca7afbf..09a1b1539d4c 100644
> > > --- a/include/net/sock_reuseport.h
> > > +++ b/include/net/sock_reuseport.h
> > > @@ -31,7 +31,7 @@ struct sock_reuseport {
> > >  extern int reuseport_alloc(struct sock *sk, bool bind_inany);
> > >  extern int reuseport_add_sock(struct sock *sk, struct sock *sk2,
> > > bool bind_inany);
> > > -extern void reuseport_detach_sock(struct sock *sk);
> > > +extern struct sock *reuseport_detach_sock(struct sock *sk);
> > >  extern struct sock *r

Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.

2020-12-07 Thread Martin KaFai Lau
On Tue, Dec 01, 2020 at 11:44:10PM +0900, Kuniyuki Iwashima wrote:

> @@ -242,8 +244,12 @@ void reuseport_detach_sock(struct sock *sk)
>  
>   reuse->num_socks--;
>   reuse->socks[i] = reuse->socks[reuse->num_socks];
> + prog = rcu_dereference(reuse->prog);
>  
>   if (sk->sk_protocol == IPPROTO_TCP) {
> + if (reuse->num_socks && !prog)
> + nsk = i == reuse->num_socks ? reuse->socks[i - 
> 1] : reuse->socks[i];
I asked in the earlier thread if the primary use case is to only
use the bpf prog to pick.  That thread did not come to
a solid answer but did conclude that the sysctl should not
control the behavior of the BPF_SK_REUSEPORT_SELECT_OR_MIGRATE prog.

>From this change here, it seems it is still desired to only depend
on the kernel to random pick even when no bpf prog is attached.
If that is the case, a sysctl to guard here for not changing
the current behavior makes sense.
It should still only control the non-bpf-pick behavior:
when the sysctl is on, the kernel will still do a random pick
when there is no bpf prog attached to the reuseport group.
Thoughts?

> +
>   reuse->num_closed_socks++;
>   reuse->socks[reuse->max_socks - 
> reuse->num_closed_socks] = sk;
>   } else {
> @@ -264,6 +270,8 @@ void reuseport_detach_sock(struct sock *sk)
>   call_rcu(&reuse->rcu, reuseport_free_rcu);
>  out:
>   spin_unlock_bh(&reuseport_lock);
> +
> + return nsk;
>  }
>  EXPORT_SYMBOL(reuseport_detach_sock);



Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.

2020-12-07 Thread Martin KaFai Lau
On Tue, Dec 08, 2020 at 03:31:34PM +0900, Kuniyuki Iwashima wrote:
> From:   Martin KaFai Lau 
> Date:   Mon, 7 Dec 2020 12:33:15 -0800
> > On Thu, Dec 03, 2020 at 11:14:24PM +0900, Kuniyuki Iwashima wrote:
> > > From:   Eric Dumazet 
> > > Date:   Tue, 1 Dec 2020 16:25:51 +0100
> > > > On 12/1/20 3:44 PM, Kuniyuki Iwashima wrote:
> > > > > This patch lets reuseport_detach_sock() return a pointer of struct 
> > > > > sock,
> > > > > which is used only by inet_unhash(). If it is not NULL,
> > > > > inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
> > > > > sockets from the closing listener to the selected one.
> > > > > 
> > > > > Listening sockets hold incoming connections as a linked list of struct
> > > > > request_sock in the accept queue, and each request has reference to a 
> > > > > full
> > > > > socket and its listener. In inet_csk_reqsk_queue_migrate(), we only 
> > > > > unlink
> > > > > the requests from the closing listener's queue and relink them to the 
> > > > > head
> > > > > of the new listener's queue. We do not process each request and its
> > > > > reference to the listener, so the migration completes in O(1) time
> > > > > complexity. However, in the case of TCP_SYN_RECV sockets, we take 
> > > > > special
> > > > > care in the next commit.
> > > > > 
> > > > > By default, the kernel selects a new listener randomly. In order to 
> > > > > pick
> > > > > out a different socket every time, we select the last element of 
> > > > > socks[] as
> > > > > the new listener. This behaviour is based on how the kernel moves 
> > > > > sockets
> > > > > in socks[]. (See also [1])
> > > > > 
> > > > > Basically, in order to redistribute sockets evenly, we have to use an 
> > > > > eBPF
> > > > > program called in the later commit, but as the side effect of such 
> > > > > default
> > > > > selection, the kernel can redistribute old requests evenly to new 
> > > > > listeners
> > > > > for a specific case where the application replaces listeners by
> > > > > generations.
> > > > > 
> > > > > For example, we call listen() for four sockets (A, B, C, D), and 
> > > > > close the
> > > > > first two by turns. The sockets move in socks[] like below.
> > > > > 
> > > > >   socks[0] : A <-.  socks[0] : D  socks[0] : D
> > > > >   socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
> > > > >   socks[2] : C   |  socks[2] : C --'
> > > > >   socks[3] : D --'
> > > > > 
> > > > > Then, if C and D have newer settings than A and B, and each socket 
> > > > > has a
> > > > > request (a, b, c, d) in their accept queue, we can redistribute old
> > > > > requests evenly to new listeners.
> > > > > 
> > > > >   socks[0] : A (a) <-.  socks[0] : D (a + d)  socks[0] : D (a 
> > > > > + d)
> > > > >   socks[1] : B (b)   |  =>  socks[1] : B (b) <-.  =>  socks[1] : C (b 
> > > > > + c)
> > > > >   socks[2] : C (c)   |  socks[2] : C (c) --'
> > > > >   socks[3] : D (d) --'
> > > > > 
> > > > > Here, (A, D) or (B, C) can have different application settings, but 
> > > > > they
> > > > > MUST have the same settings at the socket API level; otherwise, 
> > > > > unexpected
> > > > > error may happen. For instance, if only the new listeners have
> > > > > TCP_SAVE_SYN, old requests do not have SYN data, so the application 
> > > > > will
> > > > > face inconsistency and cause an error.
> > > > > 
> > > > > Therefore, if there are different kinds of sockets, we must attach an 
> > > > > eBPF
> > > > > program described in later commits.
> > > > > 
> > > > > Link: 
> > > > > https://lore.kernel.org/netdev/CAEfhGiyG8Y_amDZ2C8dQoQqjZJMHjTY76b=KBkTKcBtA=dh...@mail.gmail.com/
> > > > > Reviewed-by: Benjamin Herrenschmidt 
> > > > > Signed-off-by: Kuniyuki Iwashima 
> > > >

Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.

2020-12-08 Thread Martin KaFai Lau
On Tue, Dec 08, 2020 at 03:27:14PM +0900, Kuniyuki Iwashima wrote:
> From:   Martin KaFai Lau 
> Date:   Mon, 7 Dec 2020 12:14:38 -0800
> > On Sun, Dec 06, 2020 at 01:03:07AM +0900, Kuniyuki Iwashima wrote:
> > > From:   Martin KaFai Lau 
> > > Date:   Fri, 4 Dec 2020 17:42:41 -0800
> > > > On Tue, Dec 01, 2020 at 11:44:10PM +0900, Kuniyuki Iwashima wrote:
> > > > [ ... ]
> > > > > diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
> > > > > index fd133516ac0e..60d7c1f28809 100644
> > > > > --- a/net/core/sock_reuseport.c
> > > > > +++ b/net/core/sock_reuseport.c
> > > > > @@ -216,9 +216,11 @@ int reuseport_add_sock(struct sock *sk, struct 
> > > > > sock *sk2, bool bind_inany)
> > > > >  }
> > > > >  EXPORT_SYMBOL(reuseport_add_sock);
> > > > >  
> > > > > -void reuseport_detach_sock(struct sock *sk)
> > > > > +struct sock *reuseport_detach_sock(struct sock *sk)
> > > > >  {
> > > > >   struct sock_reuseport *reuse;
> > > > > + struct bpf_prog *prog;
> > > > > + struct sock *nsk = NULL;
> > > > >   int i;
> > > > >  
> > > > >   spin_lock_bh(&reuseport_lock);
> > > > > @@ -242,8 +244,12 @@ void reuseport_detach_sock(struct sock *sk)
> > > > >  
> > > > >   reuse->num_socks--;
> > > > >   reuse->socks[i] = reuse->socks[reuse->num_socks];
> > > > > + prog = rcu_dereference(reuse->prog);
> > > > Is it under rcu_read_lock() here?
> > > 
> > > reuseport_lock is locked in this function, and we do not modify the prog,
> > > but is rcu_dereference_protected() preferable?
> > > 
> > > ---8<---
> > > prog = rcu_dereference_protected(reuse->prog,
> > >lockdep_is_held(&reuseport_lock));
> > > ---8<---
> > It is not only reuse->prog.  Other things also require rcu_read_lock(),
> > e.g. please take a look at __htab_map_lookup_elem().
> > 
> > The TCP_LISTEN sk (selected by bpf to be the target of the migration)
> > is also protected by rcu.
> 
> Thank you, I will use rcu_read_lock() and rcu_dereference() in v3 patchset.
> 
> 
> > I am surprised there is no WARNING in the test.
> > Do you have the needed DEBUG_LOCK* config enabled?
> 
> Yes, DEBUG_LOCK* was 'y', but rcu_dereference() without rcu_read_lock()
> does not show warnings...
I would at least expect the "WARN_ON_ONCE(!rcu_read_lock_held() ...)"
from __htab_map_lookup_elem() should fire in your test
example in the last patch.

It is better to check the config before sending v3.

[ ... ]

> > > > > diff --git a/net/ipv4/inet_connection_sock.c 
> > > > > b/net/ipv4/inet_connection_sock.c
> > > > > index 1451aa9712b0..b27241ea96bd 100644
> > > > > --- a/net/ipv4/inet_connection_sock.c
> > > > > +++ b/net/ipv4/inet_connection_sock.c
> > > > > @@ -992,6 +992,36 @@ struct sock *inet_csk_reqsk_queue_add(struct 
> > > > > sock *sk,
> > > > >  }
> > > > >  EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
> > > > >  
> > > > > +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
> > > > > +{
> > > > > + struct request_sock_queue *old_accept_queue, *new_accept_queue;
> > > > > +
> > > > > + old_accept_queue = &inet_csk(sk)->icsk_accept_queue;
> > > > > + new_accept_queue = &inet_csk(nsk)->icsk_accept_queue;
> > > > > +
> > > > > + spin_lock(&old_accept_queue->rskq_lock);
> > > > > + spin_lock(&new_accept_queue->rskq_lock);
> > > > I am also not very thrilled on this double spin_lock.
> > > > Can this be done in (or like) inet_csk_listen_stop() instead?
> > > 
> > > It will be possible to migrate sockets in inet_csk_listen_stop(), but I
> > > think it is better to do it just after reuseport_detach_sock() becuase we
> > > can select a different listener (almost) every time at a lower cost by
> > > selecting the moved socket and pass it to inet_csk_reqsk_queue_migrate()
> > > easily.
> > I don't see the "lower cost" point.  Please elaborate.
> 
> In reuseport_select_sock(), we pass sk_hash of the request socket to
> reciprocal_scale() and gener

Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.

2020-12-08 Thread Martin KaFai Lau
On Tue, Dec 08, 2020 at 05:17:48PM +0900, Kuniyuki Iwashima wrote:
> From:   Martin KaFai Lau 
> Date:   Mon, 7 Dec 2020 23:34:41 -0800
> > On Tue, Dec 08, 2020 at 03:31:34PM +0900, Kuniyuki Iwashima wrote:
> > > From:   Martin KaFai Lau 
> > > Date:   Mon, 7 Dec 2020 12:33:15 -0800
> > > > On Thu, Dec 03, 2020 at 11:14:24PM +0900, Kuniyuki Iwashima wrote:
> > > > > From:   Eric Dumazet 
> > > > > Date:   Tue, 1 Dec 2020 16:25:51 +0100
> > > > > > On 12/1/20 3:44 PM, Kuniyuki Iwashima wrote:
> > > > > > > This patch lets reuseport_detach_sock() return a pointer of 
> > > > > > > struct sock,
> > > > > > > which is used only by inet_unhash(). If it is not NULL,
> > > > > > > inet_csk_reqsk_queue_migrate() migrates 
> > > > > > > TCP_ESTABLISHED/TCP_SYN_RECV
> > > > > > > sockets from the closing listener to the selected one.
> > > > > > > 
> > > > > > > Listening sockets hold incoming connections as a linked list of 
> > > > > > > struct
> > > > > > > request_sock in the accept queue, and each request has reference 
> > > > > > > to a full
> > > > > > > socket and its listener. In inet_csk_reqsk_queue_migrate(), we 
> > > > > > > only unlink
> > > > > > > the requests from the closing listener's queue and relink them to 
> > > > > > > the head
> > > > > > > of the new listener's queue. We do not process each request and 
> > > > > > > its
> > > > > > > reference to the listener, so the migration completes in O(1) time
> > > > > > > complexity. However, in the case of TCP_SYN_RECV sockets, we take 
> > > > > > > special
> > > > > > > care in the next commit.
> > > > > > > 
> > > > > > > By default, the kernel selects a new listener randomly. In order 
> > > > > > > to pick
> > > > > > > out a different socket every time, we select the last element of 
> > > > > > > socks[] as
> > > > > > > the new listener. This behaviour is based on how the kernel moves 
> > > > > > > sockets
> > > > > > > in socks[]. (See also [1])
> > > > > > > 
> > > > > > > Basically, in order to redistribute sockets evenly, we have to 
> > > > > > > use an eBPF
> > > > > > > program called in the later commit, but as the side effect of 
> > > > > > > such default
> > > > > > > selection, the kernel can redistribute old requests evenly to new 
> > > > > > > listeners
> > > > > > > for a specific case where the application replaces listeners by
> > > > > > > generations.
> > > > > > > 
> > > > > > > For example, we call listen() for four sockets (A, B, C, D), and 
> > > > > > > close the
> > > > > > > first two by turns. The sockets move in socks[] like below.
> > > > > > > 
> > > > > > >   socks[0] : A <-.  socks[0] : D  socks[0] : D
> > > > > > >   socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
> > > > > > >   socks[2] : C   |  socks[2] : C --'
> > > > > > >   socks[3] : D --'
> > > > > > > 
> > > > > > > Then, if C and D have newer settings than A and B, and each 
> > > > > > > socket has a
> > > > > > > request (a, b, c, d) in their accept queue, we can redistribute 
> > > > > > > old
> > > > > > > requests evenly to new listeners.
> > > > > > > 
> > > > > > >   socks[0] : A (a) <-.  socks[0] : D (a + d)  socks[0] : 
> > > > > > > D (a + d)
> > > > > > >   socks[1] : B (b)   |  =>  socks[1] : B (b) <-.  =>  socks[1] : 
> > > > > > > C (b + c)
> > > > > > >   socks[2] : C (c)   |  socks[2] : C (c) --'
> > > > > > >   socks[3] : D (d) --'
> > > > > > > 
> > > > > > > Here, (A, D) or (B, C) can have different application settings, 
> > > > > >

Re: [RFC PATCH bpf-next 3/8] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.

2020-11-18 Thread Martin KaFai Lau
On Tue, Nov 17, 2020 at 06:40:18PM +0900, Kuniyuki Iwashima wrote:
> This patch lets reuseport_detach_sock() return a pointer of struct sock,
> which is used only by inet_unhash(). If it is not NULL,
> inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
> sockets from the closing listener to the selected one.
> 
> Listening sockets hold incoming connections as a linked list of struct
> request_sock in the accept queue, and each request has reference to a full
> socket and its listener. In inet_csk_reqsk_queue_migrate(), we unlink the
> requests from the closing listener's queue and relink them to the head of
> the new listener's queue. We do not process each request, so the migration
> completes in O(1) time complexity. However, in the case of TCP_SYN_RECV
> sockets, we will take special care in the next commit.
> 
> By default, we select the last element of socks[] as the new listener.
> This behaviour is based on how the kernel moves sockets in socks[].
> 
> For example, we call listen() for four sockets (A, B, C, D), and close the
> first two by turns. The sockets move in socks[] like below. (See also [1])
> 
>   socks[0] : A <-.  socks[0] : D  socks[0] : D
>   socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
>   socks[2] : C   |  socks[2] : C --'
>   socks[3] : D --'
> 
> Then, if C and D have newer settings than A and B, and each socket has a
> request (a, b, c, d) in their accept queue, we can redistribute old
> requests evenly to new listeners.
I don't think it should emphasize/claim there is a specific way that
the kernel-pick here can redistribute the requests evenly.  It depends on
how the application close/listen.  The userspace can not expect the
ordering of socks[] will behave in a certain way.

The primary redistribution policy has to depend on BPF which is the
policy defined by the user based on its application logic (e.g. how
its binary restart work).  The application (and bpf) knows which one
is a dying process and can avoid distributing to it.

The kernel-pick could be an optional fallback but not a must.  If the bpf
prog is attached, I would even go further to call bpf to redistribute
regardless of the sysctl, so I think the sysctl is not necessary.

> 
>   socks[0] : A (a) <-.  socks[0] : D (a + d)  socks[0] : D (a + d)
>   socks[1] : B (b)   |  =>  socks[1] : B (b) <-.  =>  socks[1] : C (b + c)
>   socks[2] : C (c)   |  socks[2] : C (c) --'
>   socks[3] : D (d) --'
> 


Re: [RFC PATCH bpf-next 6/8] bpf: Add cookie in sk_reuseport_md.

2020-11-18 Thread Martin KaFai Lau
On Tue, Nov 17, 2020 at 06:40:21PM +0900, Kuniyuki Iwashima wrote:
> We will call sock_reuseport.prog for socket migration in the next commit,
> so the eBPF program has to know which listener is closing in order to
> select the new listener.
> 
> Currently, we can get a unique ID for each listener in the userspace by
> calling bpf_map_lookup_elem() for BPF_MAP_TYPE_REUSEPORT_SOCKARRAY map.
> This patch exposes the ID to the eBPF program.
> 
> Reviewed-by: Benjamin Herrenschmidt 
> Signed-off-by: Kuniyuki Iwashima 
> ---
>  include/linux/bpf.h| 1 +
>  include/uapi/linux/bpf.h   | 1 +
>  net/core/filter.c  | 8 
>  tools/include/uapi/linux/bpf.h | 1 +
>  4 files changed, 11 insertions(+)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 581b2a2e78eb..c0646eceffa2 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1897,6 +1897,7 @@ struct sk_reuseport_kern {
>   u32 hash;
>   u32 reuseport_id;
>   bool bind_inany;
> + u64 cookie;
>  };
>  bool bpf_tcp_sock_is_valid_access(int off, int size, enum bpf_access_type 
> type,
> struct bpf_insn_access_aux *info);
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 162999b12790..3fcddb032838 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -4403,6 +4403,7 @@ struct sk_reuseport_md {
>   __u32 ip_protocol;  /* IP protocol. e.g. IPPROTO_TCP, IPPROTO_UDP */
>   __u32 bind_inany;   /* Is sock bound to an INANY address? */
>   __u32 hash; /* A hash of the packet 4 tuples */
> + __u64 cookie;   /* ID of the listener in map */
Instead of only adding the cookie of a sk, lets make the sk pointer available:

__bpf_md_ptr(struct bpf_sock *, sk);

and then use the BPF_FUNC_get_socket_cookie to get the cookie.

Other fields of the sk can also be directly accessed too once
the sk pointer is available.


Re: [RFC PATCH bpf-next 7/8] bpf: Call bpf_run_sk_reuseport() for socket migration.

2020-11-18 Thread Martin KaFai Lau
On Tue, Nov 17, 2020 at 06:40:22PM +0900, Kuniyuki Iwashima wrote:
> This patch makes it possible to select a new listener for socket migration
> by eBPF.
> 
> The noteworthy point is that we select a listening socket in
> reuseport_detach_sock() and reuseport_select_sock(), but we do not have
> struct skb in the unhash path.
> 
> Since we cannot pass skb to the eBPF program, we run only the
> BPF_PROG_TYPE_SK_REUSEPORT program by calling bpf_run_sk_reuseport() with
> skb NULL. So, some fields derived from skb are also NULL in the eBPF
> program.
More things need to be considered here when skb is NULL.

Some helpers are probably assuming skb is not NULL.

Also, the sk_lookup in filter.c is actually passing a NULL skb to avoid
doing the reuseport select.

> 
> Moreover, we can cancel migration by returning SK_DROP. This feature is
> useful when listeners have different settings at the socket API level or
> when we want to free resources as soon as possible.
> 
> Reviewed-by: Benjamin Herrenschmidt 
> Signed-off-by: Kuniyuki Iwashima 
> ---
>  net/core/filter.c  | 26 +-
>  net/core/sock_reuseport.c  | 23 ---
>  net/ipv4/inet_hashtables.c |  2 +-
>  3 files changed, 42 insertions(+), 9 deletions(-)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 01e28f283962..ffc4591878b8 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -8914,6 +8914,22 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type 
> type,
>   SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF(S, NS, F, NF, \
>BPF_FIELD_SIZEOF(NS, NF), 0)
>  
> +#define SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF_OR_NULL(S, NS, F, NF, SIZE, 
> OFF)\
> + do {
> \
> + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(S, F), si->dst_reg,  
> \
> +   si->src_reg, offsetof(S, F)); 
> \
> + *insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 1);  
> \
Although it may not matter much, always doing this check seems not very ideal
considering the fast path will always have skb and only the slow
path (accept-queue migrate) has skb is NULL.  I think the req_sk usually
has the skb also except the timer one.

First thought is to create a temp skb but it has its own issues.
or it may actually belong to a new prog type.  However, lets keep
exploring possible options (including NULL skb).

> + *insn++ = BPF_LDX_MEM(  
> \
> + SIZE, si->dst_reg, si->dst_reg, 
> \
> + bpf_target_off(NS, NF, sizeof_field(NS, NF),
> \
> +target_size) 
> \
> + + OFF); 
> \
> + } while (0)


Re: [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT.

2020-11-18 Thread Martin KaFai Lau
On Tue, Nov 17, 2020 at 06:40:15PM +0900, Kuniyuki Iwashima wrote:
> The SO_REUSEPORT option allows sockets to listen on the same port and to
> accept connections evenly. However, there is a defect in the current
> implementation. When a SYN packet is received, the connection is tied to a
> listening socket. Accordingly, when the listener is closed, in-flight
> requests during the three-way handshake and child sockets in the accept
> queue are dropped even if other listeners could accept such connections.
> 
> This situation can happen when various server management tools restart
> server (such as nginx) processes. For instance, when we change nginx
> configurations and restart it, it spins up new workers that respect the new
> configuration and closes all listeners on the old workers, resulting in
> in-flight ACK of 3WHS is responded by RST.
> 
> As a workaround for this issue, we can do connection draining by eBPF:
> 
>   1. Before closing a listener, stop routing SYN packets to it.
>   2. Wait enough time for requests to complete 3WHS.
>   3. Accept connections until EAGAIN, then close the listener.
> 
> Although this approach seems to work well, EAGAIN has nothing to do with
> how many requests are still during 3WHS. Thus, we have to know the number
It sounds like the application can already drain the established socket
by accept()?  To solve the problem that you have,
does it mean migrating req_sk (the in-progress 3WHS) is enough?

Applications can already use the bpf prog to do (1) and divert
the SYN to the newly started process.

If the application cares about service disruption,
it usually needs to drain the fd(s) that it already has and
finishes serving the pending request (e.g. https) on them anyway.
The time taking to finish those could already be longer than it takes
to drain the accept queue or finish off the 3WHS in reasonable time.
or the application that you have does not need to drain the fd(s) 
it already has and it can close them immediately?

> of such requests by counting SYN packets by eBPF to complete connection
> draining.
> 
>   1. Start counting SYN packets and accept syscalls using eBPF map.
>   2. Stop routing SYN packets.
>   3. Accept connections up to the count, then close the listener.


Re: [PATCH v2 2/5] bpf: Add a bpf_sock_from_file helper

2020-11-19 Thread Martin KaFai Lau
On Thu, Nov 19, 2020 at 05:26:51PM +0100, Florent Revest wrote:
> From: Florent Revest 
> 
> While eBPF programs can check whether a file is a socket by file->f_op
> == &socket_file_ops, they cannot convert the void private_data pointer
> to a struct socket BTF pointer. In order to do this a new helper
> wrapping sock_from_file is added.
> 
> This is useful to tracing programs but also other program types
> inheriting this set of helpers such as iterators or LSM programs.
Acked-by: Martin KaFai Lau 


Re: [PATCH v2 3/5] bpf: Expose bpf_sk_storage_* to iterator programs

2020-11-19 Thread Martin KaFai Lau
On Thu, Nov 19, 2020 at 05:26:52PM +0100, Florent Revest wrote:
> From: Florent Revest 
> 
> Iterators are currently used to expose kernel information to userspace
> over fast procfs-like files but iterators could also be used to
> manipulate local storage. For example, the task_file iterator could be
> used to initialize a socket local storage with associations between
> processes and sockets or to selectively delete local storage values.
> 
> This exposes both socket local storage helpers to all iterators.
> Alternatively we could expose it to only certain iterators with strcmps
> on prog->aux->attach_func_name.
I cannot see any hole to iter the bpf_sk_storage_map and also
bpf_sk_storage_get/delete() itself for now.

I have looked at other iters (e.g. tcp, udp, and sock_map iter).
It will be good if you can double check them also.

I think at least one more test on the tcp iter is needed.

Other than that,

Acked-by: Martin KaFai Lau 


Re: [PATCH v2 4/5] bpf: Add an iterator selftest for bpf_sk_storage_delete

2020-11-19 Thread Martin KaFai Lau
On Thu, Nov 19, 2020 at 05:26:53PM +0100, Florent Revest wrote:
> From: Florent Revest 
> 
> The eBPF program iterates over all entries (well, only one) of a socket
> local storage map and deletes them all. The test makes sure that the
> entry is indeed deleted.
Note that if there are many entries and seq->op->stop() is called (due to
seq_has_overflowed()).  It is possible that not all of the entries will be
iterated (and deleted).  However, I think it is a more generic issue in
resuming the iteration and not specific to this series.

Acked-by: Martin KaFai Lau 


Re: [PATCH v2 5/5] bpf: Add an iterator selftest for bpf_sk_storage_get

2020-11-19 Thread Martin KaFai Lau
On Thu, Nov 19, 2020 at 05:26:54PM +0100, Florent Revest wrote:
> From: Florent Revest 
> 
> The eBPF program iterates over all files and tasks. For all socket
> files, it stores the tgid of the last task it encountered with a handle
> to that socket. This is a heuristic for finding the "owner" of a socket
> similar to what's done by lsof, ss, netstat or fuser. Potentially, this
> information could be used from a cgroup_skb/*gress hook to try to
> associate network traffic with processes.
> 
> The test makes sure that a socket it created is tagged with prog_tests's
> pid.
> 
> Signed-off-by: Florent Revest 
> ---
>  .../selftests/bpf/prog_tests/bpf_iter.c   | 35 +++
>  .../progs/bpf_iter_bpf_sk_storage_helpers.c   | 26 ++
>  2 files changed, 61 insertions(+)
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c 
> b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
> index bb4a638f2e6f..4d0626003c03 100644
> --- a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
> +++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
> @@ -975,6 +975,39 @@ static void test_bpf_sk_storage_delete(void)
>   bpf_iter_bpf_sk_storage_helpers__destroy(skel);
>  }
>  
> +/* The BPF program stores in every socket the tgid of a task owning a handle 
> to
> + * it. The test verifies that a locally-created socket is tagged with its pid
> + */
> +static void test_bpf_sk_storage_get(void)
> +{
> + struct bpf_iter_bpf_sk_storage_helpers *skel;
> + int err, map_fd, val = -1;
> + int sock_fd = -1;
> +
> + skel = bpf_iter_bpf_sk_storage_helpers__open_and_load();
> + if (CHECK(!skel, "bpf_iter_bpf_sk_storage_helpers__open_and_load",
> +   "skeleton open_and_load failed\n"))
> + return;
> +
> + sock_fd = socket(AF_INET6, SOCK_STREAM, 0);
> + if (CHECK(sock_fd < 0, "socket", "errno: %d\n", errno))
> + goto out;
> +
> + do_dummy_read(skel->progs.fill_socket_owners);
> +
> + map_fd = bpf_map__fd(skel->maps.sk_stg_map);
> +
> + err = bpf_map_lookup_elem(map_fd, &sock_fd, &val);
> + CHECK(err || val != getpid(), "bpf_map_lookup_elem",
> +   "map value wasn't set correctly (expected %d, got %d, err=%d)\n",
> +   getpid(), val, err);
> +
> + if (sock_fd >= 0)
> + close(sock_fd);
> +out:
> + bpf_iter_bpf_sk_storage_helpers__destroy(skel);
> +}
> +
>  static void test_bpf_sk_storage_map(void)
>  {
>   DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
> @@ -1131,6 +1164,8 @@ void test_bpf_iter(void)
>   test_bpf_sk_storage_map();
>   if (test__start_subtest("bpf_sk_storage_delete"))
>   test_bpf_sk_storage_delete();
> + if (test__start_subtest("bpf_sk_storage_get"))
> + test_bpf_sk_storage_get();
>   if (test__start_subtest("rdonly-buf-out-of-bound"))
>   test_rdonly_buf_out_of_bound();
>   if (test__start_subtest("buf-neg-offset"))
> diff --git 
> a/tools/testing/selftests/bpf/progs/bpf_iter_bpf_sk_storage_helpers.c 
> b/tools/testing/selftests/bpf/progs/bpf_iter_bpf_sk_storage_helpers.c
> index 01ff3235e413..7206fd6f09ab 100644
> --- a/tools/testing/selftests/bpf/progs/bpf_iter_bpf_sk_storage_helpers.c
> +++ b/tools/testing/selftests/bpf/progs/bpf_iter_bpf_sk_storage_helpers.c
> @@ -21,3 +21,29 @@ int delete_bpf_sk_storage_map(struct 
> bpf_iter__bpf_sk_storage_map *ctx)
>  
>   return 0;
>  }
> +
> +SEC("iter/task_file")
> +int fill_socket_owners(struct bpf_iter__task_file *ctx)
> +{
> + struct task_struct *task = ctx->task;
> + struct file *file = ctx->file;
> + struct socket *sock;
> + int *sock_tgid;
> +
> + if (!task || !file || task->tgid != task->pid)
> + return 0;
> +
> + sock = bpf_sock_from_file(file);
> + if (!sock)
> + return 0;
> +
> + sock_tgid = bpf_sk_storage_get(&sk_stg_map, sock->sk, 0,
> +BPF_SK_STORAGE_GET_F_CREATE);
Does it affect all sk(s) in the system?  Can it be limited to
the sk that the test is testing?


Re: [RFC PATCH bpf-next 3/8] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.

2020-11-19 Thread Martin KaFai Lau
On Fri, Nov 20, 2020 at 07:09:22AM +0900, Kuniyuki Iwashima wrote:
> From: Martin KaFai Lau 
> Date: Wed, 18 Nov 2020 15:50:17 -0800
> > On Tue, Nov 17, 2020 at 06:40:18PM +0900, Kuniyuki Iwashima wrote:
> > > This patch lets reuseport_detach_sock() return a pointer of struct sock,
> > > which is used only by inet_unhash(). If it is not NULL,
> > > inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
> > > sockets from the closing listener to the selected one.
> > > 
> > > Listening sockets hold incoming connections as a linked list of struct
> > > request_sock in the accept queue, and each request has reference to a full
> > > socket and its listener. In inet_csk_reqsk_queue_migrate(), we unlink the
> > > requests from the closing listener's queue and relink them to the head of
> > > the new listener's queue. We do not process each request, so the migration
> > > completes in O(1) time complexity. However, in the case of TCP_SYN_RECV
> > > sockets, we will take special care in the next commit.
> > > 
> > > By default, we select the last element of socks[] as the new listener.
> > > This behaviour is based on how the kernel moves sockets in socks[].
> > > 
> > > For example, we call listen() for four sockets (A, B, C, D), and close the
> > > first two by turns. The sockets move in socks[] like below. (See also [1])
> > > 
> > >   socks[0] : A <-.  socks[0] : D  socks[0] : D
> > >   socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
> > >   socks[2] : C   |  socks[2] : C --'
> > >   socks[3] : D --'
> > > 
> > > Then, if C and D have newer settings than A and B, and each socket has a
> > > request (a, b, c, d) in their accept queue, we can redistribute old
> > > requests evenly to new listeners.
> > I don't think it should emphasize/claim there is a specific way that
> > the kernel-pick here can redistribute the requests evenly.  It depends on
> > how the application close/listen.  The userspace can not expect the
> > ordering of socks[] will behave in a certain way.
> 
> I've expected replacing listeners by generations as a general use case.
> But exactly. Users should not expect the undocumented kernel internal.
> 
> 
> > The primary redistribution policy has to depend on BPF which is the
> > policy defined by the user based on its application logic (e.g. how
> > its binary restart work).  The application (and bpf) knows which one
> > is a dying process and can avoid distributing to it.
> > 
> > The kernel-pick could be an optional fallback but not a must.  If the bpf
> > prog is attached, I would even go further to call bpf to redistribute
> > regardless of the sysctl, so I think the sysctl is not necessary.
> 
> I also think it is just an optional fallback, but to pick out a different
> listener everytime, choosing the moved socket was reasonable. So the even
> redistribution for a specific use case is a side effect of such socket
> selection.
> 
> But, users should decide to use either way:
>   (1) let the kernel select a new listener randomly
>   (2) select a particular listener by eBPF
> 
> I will update the commit message like:
> The kernel selects a new listener randomly, but as the side effect, it can
> redistribute packets evenly for a specific case where an application
> replaces listeners by generations.
Since there is no feedback on sysctl, so may be something missed
in the lines.

I don't think this migration logic should depend on a sysctl.
At least not when a bpf prog is attached that is capable of doing
migration, it is too fragile to ask user to remember to turn on
the sysctl before attaching the bpf prog.

Your use case is to primarily based on bpf prog to pick or only based
on kernel to do a random pick?

Also, IIUC, this sysctl setting sticks at "*reuse", there is no way to
change it until all the listening sockets are closed which is exactly
the service disruption problem this series is trying to solve here.


Re: [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT.

2020-11-19 Thread Martin KaFai Lau
On Fri, Nov 20, 2020 at 07:17:49AM +0900, Kuniyuki Iwashima wrote:
> From:   Martin KaFai Lau 
> Date:   Wed, 18 Nov 2020 17:49:13 -0800
> > On Tue, Nov 17, 2020 at 06:40:15PM +0900, Kuniyuki Iwashima wrote:
> > > The SO_REUSEPORT option allows sockets to listen on the same port and to
> > > accept connections evenly. However, there is a defect in the current
> > > implementation. When a SYN packet is received, the connection is tied to a
> > > listening socket. Accordingly, when the listener is closed, in-flight
> > > requests during the three-way handshake and child sockets in the accept
> > > queue are dropped even if other listeners could accept such connections.
> > > 
> > > This situation can happen when various server management tools restart
> > > server (such as nginx) processes. For instance, when we change nginx
> > > configurations and restart it, it spins up new workers that respect the 
> > > new
> > > configuration and closes all listeners on the old workers, resulting in
> > > in-flight ACK of 3WHS is responded by RST.
> > > 
> > > As a workaround for this issue, we can do connection draining by eBPF:
> > > 
> > >   1. Before closing a listener, stop routing SYN packets to it.
> > >   2. Wait enough time for requests to complete 3WHS.
> > >   3. Accept connections until EAGAIN, then close the listener.
> > > 
> > > Although this approach seems to work well, EAGAIN has nothing to do with
> > > how many requests are still during 3WHS. Thus, we have to know the number
> > It sounds like the application can already drain the established socket
> > by accept()?  To solve the problem that you have,
> > does it mean migrating req_sk (the in-progress 3WHS) is enough?
> 
> Ideally, the application needs to drain only the accepted sockets because
> 3WHS and tying a connection to a listener are just kernel behaviour. Also,
> there are some cases where we want to apply new configurations as soon as
> possible such as replacing TLS certificates.
> 
> It is possible to drain the established sockets by accept(), but the
> sockets in the accept queue have not started application sessions yet. So,
> if we do not drain such sockets (or if the kernel happened to select
> another listener), we can (could) apply the new settings much earlier.
> 
> Moreover, the established sockets may start long-standing connections so
> that we cannot complete draining for a long time and may have to
> force-close them (and they would have longer lifetime if they are migrated
> to a new listener).
> 
> 
> > Applications can already use the bpf prog to do (1) and divert
> > the SYN to the newly started process.
> > 
> > If the application cares about service disruption,
> > it usually needs to drain the fd(s) that it already has and
> > finishes serving the pending request (e.g. https) on them anyway.
> > The time taking to finish those could already be longer than it takes
> > to drain the accept queue or finish off the 3WHS in reasonable time.
> > or the application that you have does not need to drain the fd(s) 
> > it already has and it can close them immediately?
> 
> In the point of view of service disruption, I agree with you.
> 
> However, I think that there are some situations where we want to apply new
> configurations rather than to drain sockets with old configurations and
> that if the kernel migrates sockets automatically, we can simplify user
> programs.
This configuration-update(/new-TLS-cert...etc) consideration will be useful
if it is also included in the cover letter.

It sounds like the service that you have is draining the existing
already-accepted fd(s) which are using the old configuration.
Those existing fd(s) could also be long life.  Potentially those
existing fd(s) will be in a much higher number than the
to-be-accepted fd(s)?

or you meant in some cases it wants to migrate to the new configuration
ASAP (e.g. for security reason) even it has to close all the
already-accepted fds() which are using the old configuration??

In either cases, considering the already-accepted fd(s)
is usually in a much more number, does the to-be-accepted
connection make any difference percentage-wise?


Re: [PATCH bpf-next 1/6] bpf: fix bpf_put_raw_tracepoint()'s use of __module_address()

2020-11-20 Thread Martin KaFai Lau
On Thu, Nov 19, 2020 at 03:22:39PM -0800, Andrii Nakryiko wrote:
> __module_address() needs to be called with preemption disabled or with
> module_mutex taken. preempt_disable() is enough for read-only uses, which is
> what this fix does.
Acked-by: Martin KaFai Lau 


Re: [PATCH bpf-next 2/6] libbpf: add internal helper to load BTF data by FD

2020-11-20 Thread Martin KaFai Lau
On Thu, Nov 19, 2020 at 03:22:40PM -0800, Andrii Nakryiko wrote:
[ ... ]

> +int btf__get_from_id(__u32 id, struct btf **btf)
> +{
> + struct btf *res;
> + int btf_fd;
> +
> + *btf = NULL;
> + btf_fd = bpf_btf_get_fd_by_id(id);
> + if (btf_fd < 0)
> + return 0;
It should return an error.

> +
> + res = btf_get_from_fd(btf_fd, NULL);
> + close(btf_fd);
> + if (IS_ERR(res))
> + return PTR_ERR(res);
> +
> + *btf = res;
> + return 0;
>  }
>  


Re: [PATCH bpf-next 3/6] libbpf: refactor CO-RE relocs to not assume a single BTF object

2020-11-20 Thread Martin KaFai Lau
On Thu, Nov 19, 2020 at 03:22:41PM -0800, Andrii Nakryiko wrote:
> Refactor CO-RE relocation candidate search to not expect a single BTF, rather
> return all candidate types with their corresponding BTF objects. This will
> allow to extend CO-RE relocations to accommodate kernel module BTFs.
Acked-by: Martin KaFai Lau 


Re: [PATCH bpf-next 4/6] libbpf: add kernel module BTF support for CO-RE relocations

2020-11-20 Thread Martin KaFai Lau
On Thu, Nov 19, 2020 at 03:22:42PM -0800, Andrii Nakryiko wrote:
[ ... ]

> +static int load_module_btfs(struct bpf_object *obj)
> +{
> + struct bpf_btf_info info;
> + struct module_btf *mod_btf;
> + struct btf *btf;
> + char name[64];
> + __u32 id, len;
> + int err, fd;
> +
> + if (obj->btf_modules_loaded)
> + return 0;
> +
> + /* don't do this again, even if we find no module BTFs */
> + obj->btf_modules_loaded = true;
> +
> + /* kernel too old to support module BTFs */
> + if (!kernel_supports(FEAT_MODULE_BTF))
> + return 0;
> +
> + while (true) {
> + err = bpf_btf_get_next_id(id, &id);
> + if (err && errno == ENOENT)
> + return 0;
> + if (err) {
> + err = -errno;
> + pr_warn("failed to iterate BTF objects: %d\n", err);
> + return err;
> + }
> +
> + fd = bpf_btf_get_fd_by_id(id);
> + if (fd < 0) {
> + if (errno == ENOENT)
> + continue; /* expected race: BTF was unloaded */
> + err = -errno;
> + pr_warn("failed to get BTF object #%d FD: %d\n", id, 
> err);
> + return err;
> + }
> +
> + len = sizeof(info);
> + memset(&info, 0, sizeof(info));
> + info.name = ptr_to_u64(name);
> + info.name_len = sizeof(name);
> +
> + err = bpf_obj_get_info_by_fd(fd, &info, &len);
> + if (err) {
> + err = -errno;
> + pr_warn("failed to get BTF object #%d info: %d\n", id, 
> err);

close(fd);

> + return err;
> + }
> +
> + /* ignore non-module BTFs */
> + if (!info.kernel_btf || strcmp(name, "vmlinux") == 0) {
> + close(fd);
> + continue;
> + }
> +

[ ... ]

> @@ -8656,9 +8815,6 @@ static inline int __find_vmlinux_btf_id(struct btf 
> *btf, const char *name,
>   else
>   err = btf__find_by_name_kind(btf, name, BTF_KIND_FUNC);
>  
> - if (err <= 0)
> - pr_warn("%s is not found in vmlinux BTF\n", name);
> -
>   return err;
>  }
>  
> @@ -8675,6 +8831,9 @@ int libbpf_find_vmlinux_btf_id(const char *name,
>   }
>  
>   err = __find_vmlinux_btf_id(btf, name, attach_type);
> + if (err <= 0)
> + pr_warn("%s is not found in vmlinux BTF\n", name);
> +
Please explain this move in the commit message.

>   btf__free(btf);
>   return err;
>  }
> -- 
> 2.24.1
> 


Re: [PATCH bpf-next 5/6] selftests/bpf: add bpf_sidecar kernel module for testing

2020-11-20 Thread Martin KaFai Lau
On Thu, Nov 19, 2020 at 03:22:43PM -0800, Andrii Nakryiko wrote:
> Add bpf_sidecar module, which is conceptually out-of-tree module and provides
> ways for selftests/bpf to test various kernel module-related functionality:
> raw tracepoint, fentry/fexit/fmod_ret, etc. This module will be auto-loaded by
> test_progs test runner and expected by some of selftests to be present and
> loaded.
Acked-by: Martin KaFai Lau 


Re: [PATCH bpf-next 6/6] selftests/bpf: add CO-RE relocs selftest relying on kernel module BTF

2020-11-20 Thread Martin KaFai Lau
On Thu, Nov 19, 2020 at 03:22:44PM -0800, Andrii Nakryiko wrote:
> Add a self-tests validating libbpf is able to perform CO-RE relocations
> against the type defined in kernel module BTF.
Acked-by: Martin KaFai Lau 


Re: [PATCH bpf-next v2 1/7] samples: bpf: refactor hbm program with libbpf

2020-11-20 Thread Martin KaFai Lau
On Thu, Nov 19, 2020 at 03:06:11PM +, Daniel T. Lee wrote:
[ ... ]

>  static int run_bpf_prog(char *prog, int cg_id)
>  {
> - int map_fd;
> - int rc = 0;
> + struct hbm_queue_stats qstats = {0};
> + char cg_dir[100], cg_pin_path[100];
> + struct bpf_link *link = NULL;
>   int key = 0;
>   int cg1 = 0;
> - int type = BPF_CGROUP_INET_EGRESS;
> - char cg_dir[100];
> - struct hbm_queue_stats qstats = {0};
> + int rc = 0;
>  
>   sprintf(cg_dir, "/hbm%d", cg_id);
> - map_fd = prog_load(prog);
> - if (map_fd  == -1)
> - return 1;
> + rc = prog_load(prog);
> + if (rc != 0)
> + return rc;
>  
>   if (setup_cgroup_environment()) {
>   printf("ERROR: setting cgroup environment\n");
> @@ -190,16 +183,25 @@ static int run_bpf_prog(char *prog, int cg_id)
>   qstats.stats = stats_flag ? 1 : 0;
>   qstats.loopback = loopback_flag ? 1 : 0;
>   qstats.no_cn = no_cn_flag ? 1 : 0;
> - if (bpf_map_update_elem(map_fd, &key, &qstats, BPF_ANY)) {
> + if (bpf_map_update_elem(queue_stats_fd, &key, &qstats, BPF_ANY)) {
>   printf("ERROR: Could not update map element\n");
>   goto err;
>   }
>  
>   if (!outFlag)
> - type = BPF_CGROUP_INET_INGRESS;
> - if (bpf_prog_attach(bpfprog_fd, cg1, type, 0)) {
> - printf("ERROR: bpf_prog_attach fails!\n");
> - log_err("Attaching prog");
> + bpf_program__set_expected_attach_type(bpf_prog, 
> BPF_CGROUP_INET_INGRESS);
> +
> + link = bpf_program__attach_cgroup(bpf_prog, cg1);
> + if (libbpf_get_error(link)) {
> + fprintf(stderr, "ERROR: bpf_program__attach_cgroup failed\n");
> + link = NULL;
Again, this is not needed.  bpf_link__destroy() can
handle both NULL and error pointer.  Please take a look
at the bpf_link__destroy() in libbpf.c

> + goto err;
> + }
> +
> + sprintf(cg_pin_path, "/sys/fs/bpf/hbm%d", cg_id);
> + rc = bpf_link__pin(link, cg_pin_path);
> + if (rc < 0) {
> + printf("ERROR: bpf_link__pin failed: %d\n", rc);
>   goto err;
>   }
>  
> @@ -213,7 +215,7 @@ static int run_bpf_prog(char *prog, int cg_id)
>  #define DELTA_RATE_CHECK 1   /* in us */
>  #define RATE_THRESHOLD 95/* 9.5 Gbps */
>  
> - bpf_map_lookup_elem(map_fd, &key, &qstats);
> + bpf_map_lookup_elem(queue_stats_fd, &key, &qstats);
>   if (gettimeofday(&t0, NULL) < 0)
>   do_error("gettimeofday failed", true);
>   t_last = t0;
> @@ -242,7 +244,7 @@ static int run_bpf_prog(char *prog, int cg_id)
>   fclose(fin);
>   printf("  new_eth_tx_bytes:%llu\n",
>  new_eth_tx_bytes);
> - bpf_map_lookup_elem(map_fd, &key, &qstats);
> + bpf_map_lookup_elem(queue_stats_fd, &key, &qstats);
>   new_cg_tx_bytes = qstats.bytes_total;
>   delta_bytes = new_eth_tx_bytes - last_eth_tx_bytes;
>   last_eth_tx_bytes = new_eth_tx_bytes;
> @@ -289,14 +291,14 @@ static int run_bpf_prog(char *prog, int cg_id)
>   rate = minRate;
>   qstats.rate = rate;
>   }
> - if (bpf_map_update_elem(map_fd, &key, &qstats, BPF_ANY))
> + if (bpf_map_update_elem(queue_stats_fd, &key, &qstats, 
> BPF_ANY))
>   do_error("update map element fails", false);
>   }
>   } else {
>   sleep(dur);
>   }
>   // Get stats!
> - if (stats_flag && bpf_map_lookup_elem(map_fd, &key, &qstats)) {
> + if (stats_flag && bpf_map_lookup_elem(queue_stats_fd, &key, &qstats)) {
>   char fname[100];
>   FILE *fout;
>  
> @@ -398,10 +400,10 @@ static int run_bpf_prog(char *prog, int cg_id)
>  err:
>   rc = 1;
>  
> - if (cg1)
> - close(cg1);
> + bpf_link__destroy(link);
> + close(cg1);
>   cleanup_cgroup_environment();
> -
> + bpf_object__close(obj);
The bpf_* cleanup condition still looks wrong.

I can understand why it does not want to cleanup_cgroup_environment()
on the success case because the sh script may want to run test under this
cgroup.

However, the bpf_link__destroy(), bpf_object__close(), and
even close(cg1) should be done in both success and error
cases.

The cg1 test still looks wrong also.  The cg1 should
be init to -1 and then test for "if (cg1 == -1)".


Re: [PATCH bpf-next v2 1/7] samples: bpf: refactor hbm program with libbpf

2020-11-20 Thread Martin KaFai Lau
On Fri, Nov 20, 2020 at 06:34:05PM -0800, Martin KaFai Lau wrote:
> On Thu, Nov 19, 2020 at 03:06:11PM +, Daniel T. Lee wrote:
> [ ... ]
> 
> >  static int run_bpf_prog(char *prog, int cg_id)
> >  {
> > -   int map_fd;
> > -   int rc = 0;
> > +   struct hbm_queue_stats qstats = {0};
> > +   char cg_dir[100], cg_pin_path[100];
> > +   struct bpf_link *link = NULL;
> > int key = 0;
> > int cg1 = 0;
> > -   int type = BPF_CGROUP_INET_EGRESS;
> > -   char cg_dir[100];
> > -   struct hbm_queue_stats qstats = {0};
> > +   int rc = 0;
> >  
> > sprintf(cg_dir, "/hbm%d", cg_id);
> > -   map_fd = prog_load(prog);
> > -   if (map_fd  == -1)
> > -   return 1;
> > +   rc = prog_load(prog);
> > +   if (rc != 0)
> > +   return rc;
> >  
> > if (setup_cgroup_environment()) {
> > printf("ERROR: setting cgroup environment\n");
> > @@ -190,16 +183,25 @@ static int run_bpf_prog(char *prog, int cg_id)
> > qstats.stats = stats_flag ? 1 : 0;
> > qstats.loopback = loopback_flag ? 1 : 0;
> > qstats.no_cn = no_cn_flag ? 1 : 0;
> > -   if (bpf_map_update_elem(map_fd, &key, &qstats, BPF_ANY)) {
> > +   if (bpf_map_update_elem(queue_stats_fd, &key, &qstats, BPF_ANY)) {
> > printf("ERROR: Could not update map element\n");
> > goto err;
> > }
> >  
> > if (!outFlag)
> > -   type = BPF_CGROUP_INET_INGRESS;
> > -   if (bpf_prog_attach(bpfprog_fd, cg1, type, 0)) {
> > -   printf("ERROR: bpf_prog_attach fails!\n");
> > -   log_err("Attaching prog");
> > +   bpf_program__set_expected_attach_type(bpf_prog, 
> > BPF_CGROUP_INET_INGRESS);
> > +
> > +   link = bpf_program__attach_cgroup(bpf_prog, cg1);
> > +   if (libbpf_get_error(link)) {
> > +   fprintf(stderr, "ERROR: bpf_program__attach_cgroup failed\n");
> > +   link = NULL;
> Again, this is not needed.  bpf_link__destroy() can
> handle both NULL and error pointer.  Please take a look
> at the bpf_link__destroy() in libbpf.c
> 
> > +   goto err;
> > +   }
> > +
> > +   sprintf(cg_pin_path, "/sys/fs/bpf/hbm%d", cg_id);
> > +   rc = bpf_link__pin(link, cg_pin_path);
> > +   if (rc < 0) {
> > +   printf("ERROR: bpf_link__pin failed: %d\n", rc);
> > goto err;
> > }
> >  
> > @@ -213,7 +215,7 @@ static int run_bpf_prog(char *prog, int cg_id)
> >  #define DELTA_RATE_CHECK 1 /* in us */
> >  #define RATE_THRESHOLD 95  /* 9.5 Gbps */
> >  
> > -   bpf_map_lookup_elem(map_fd, &key, &qstats);
> > +   bpf_map_lookup_elem(queue_stats_fd, &key, &qstats);
> > if (gettimeofday(&t0, NULL) < 0)
> > do_error("gettimeofday failed", true);
> > t_last = t0;
> > @@ -242,7 +244,7 @@ static int run_bpf_prog(char *prog, int cg_id)
> > fclose(fin);
> > printf("  new_eth_tx_bytes:%llu\n",
> >new_eth_tx_bytes);
> > -   bpf_map_lookup_elem(map_fd, &key, &qstats);
> > +   bpf_map_lookup_elem(queue_stats_fd, &key, &qstats);
> > new_cg_tx_bytes = qstats.bytes_total;
> > delta_bytes = new_eth_tx_bytes - last_eth_tx_bytes;
> > last_eth_tx_bytes = new_eth_tx_bytes;
> > @@ -289,14 +291,14 @@ static int run_bpf_prog(char *prog, int cg_id)
> > rate = minRate;
> > qstats.rate = rate;
> > }
> > -   if (bpf_map_update_elem(map_fd, &key, &qstats, BPF_ANY))
> > +   if (bpf_map_update_elem(queue_stats_fd, &key, &qstats, 
> > BPF_ANY))
> > do_error("update map element fails", false);
> > }
> > } else {
> > sleep(dur);
> > }
> > // Get stats!
> > -   if (stats_flag && bpf_map_lookup_elem(map_fd, &key, &qstats)) {
> > +   if (stats_flag && bpf_map_lookup_elem(queue_stats_fd, &key, &qstats)) {
> > char fname[100];
> > FILE *fout;
> >  
> > @@ -398,10 +400,10 @@ static int run_bpf_prog(char *prog, int cg_id)
> >  err:
> > rc = 1;
> >  
> > -   if (cg1)
> > -   close(cg1);
> > +   bpf_link__destroy(link);
> > +   close(cg1);
> > cleanup_cgroup_environment();
> > -
> > +   bpf_object__close(obj);
> The bpf_* cleanup condition still looks wrong.
> 
> I can understand why it does not want to cleanup_cgroup_environment()
> on the success case because the sh script may want to run test under this
> cgroup.
> 
> However, the bpf_link__destroy(), bpf_object__close(), and
> even close(cg1) should be done in both success and error
> cases.
> 
> The cg1 test still looks wrong also.  The cg1 should
> be init to -1 and then test for "if (cg1 == -1)".
oops.  I meant cg1 != -1 .


Re: [RFC PATCH bpf-next 3/8] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.

2020-11-22 Thread Martin KaFai Lau
On Sat, Nov 21, 2020 at 07:13:22PM +0900, Kuniyuki Iwashima wrote:
> From:   Martin KaFai Lau 
> Date:   Thu, 19 Nov 2020 17:53:46 -0800
> > On Fri, Nov 20, 2020 at 07:09:22AM +0900, Kuniyuki Iwashima wrote:
> > > From: Martin KaFai Lau 
> > > Date: Wed, 18 Nov 2020 15:50:17 -0800
> > > > On Tue, Nov 17, 2020 at 06:40:18PM +0900, Kuniyuki Iwashima wrote:
> > > > > This patch lets reuseport_detach_sock() return a pointer of struct 
> > > > > sock,
> > > > > which is used only by inet_unhash(). If it is not NULL,
> > > > > inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
> > > > > sockets from the closing listener to the selected one.
> > > > > 
> > > > > Listening sockets hold incoming connections as a linked list of struct
> > > > > request_sock in the accept queue, and each request has reference to a 
> > > > > full
> > > > > socket and its listener. In inet_csk_reqsk_queue_migrate(), we unlink 
> > > > > the
> > > > > requests from the closing listener's queue and relink them to the 
> > > > > head of
> > > > > the new listener's queue. We do not process each request, so the 
> > > > > migration
> > > > > completes in O(1) time complexity. However, in the case of 
> > > > > TCP_SYN_RECV
> > > > > sockets, we will take special care in the next commit.
> > > > > 
> > > > > By default, we select the last element of socks[] as the new listener.
> > > > > This behaviour is based on how the kernel moves sockets in socks[].
> > > > > 
> > > > > For example, we call listen() for four sockets (A, B, C, D), and 
> > > > > close the
> > > > > first two by turns. The sockets move in socks[] like below. (See also 
> > > > > [1])
> > > > > 
> > > > >   socks[0] : A <-.  socks[0] : D  socks[0] : D
> > > > >   socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
> > > > >   socks[2] : C   |  socks[2] : C --'
> > > > >   socks[3] : D --'
> > > > > 
> > > > > Then, if C and D have newer settings than A and B, and each socket 
> > > > > has a
> > > > > request (a, b, c, d) in their accept queue, we can redistribute old
> > > > > requests evenly to new listeners.
> > > > I don't think it should emphasize/claim there is a specific way that
> > > > the kernel-pick here can redistribute the requests evenly.  It depends 
> > > > on
> > > > how the application close/listen.  The userspace can not expect the
> > > > ordering of socks[] will behave in a certain way.
> > > 
> > > I've expected replacing listeners by generations as a general use case.
> > > But exactly. Users should not expect the undocumented kernel internal.
> > > 
> > > 
> > > > The primary redistribution policy has to depend on BPF which is the
> > > > policy defined by the user based on its application logic (e.g. how
> > > > its binary restart work).  The application (and bpf) knows which one
> > > > is a dying process and can avoid distributing to it.
> > > > 
> > > > The kernel-pick could be an optional fallback but not a must.  If the 
> > > > bpf
> > > > prog is attached, I would even go further to call bpf to redistribute
> > > > regardless of the sysctl, so I think the sysctl is not necessary.
> > > 
> > > I also think it is just an optional fallback, but to pick out a different
> > > listener everytime, choosing the moved socket was reasonable. So the even
> > > redistribution for a specific use case is a side effect of such socket
> > > selection.
> > > 
> > > But, users should decide to use either way:
> > >   (1) let the kernel select a new listener randomly
> > >   (2) select a particular listener by eBPF
> > > 
> > > I will update the commit message like:
> > > The kernel selects a new listener randomly, but as the side effect, it can
> > > redistribute packets evenly for a specific case where an application
> > > replaces listeners by generations.
> > Since there is no feedback on sysctl, so may be something missed
> > in the lines.
> 
> I'm sorry, I have missed this point while thinking about each reply...
> 
> 
> > I don

Re: [PATCH v3 bpf-next 2/5] bpf: assign ID to vmlinux BTF and return extra info for BTF in GET_OBJ_INFO

2020-11-09 Thread Martin KaFai Lau
On Mon, Nov 09, 2020 at 01:00:21PM -0800, Andrii Nakryiko wrote:
> Allocate ID for vmlinux BTF. This makes it visible when iterating over all BTF
> objects in the system. To allow distinguishing vmlinux BTF (and later kernel
> module BTF) from user-provided BTFs, expose extra kernel_btf flag, as well as
> BTF name ("vmlinux" for vmlinux BTF, will equal to module's name for module
> BTF).  We might want to later allow specifying BTF name for user-provided BTFs
> as well, if that makes sense. But currently this is reserved only for
> in-kernel BTFs.
> 
> Having in-kernel BTFs exposed IDs will allow to extend BPF APIs that require
> in-kernel BTF type with ability to specify BTF types from kernel modules, not
> just vmlinux BTF. This will be implemented in a follow up patch set for
> fentry/fexit/fmod_ret/lsm/etc.
> 
> Signed-off-by: Andrii Nakryiko 
> ---
>  include/uapi/linux/bpf.h   |  3 +++
>  kernel/bpf/btf.c   | 39 --
>  tools/include/uapi/linux/bpf.h |  3 +++
>  3 files changed, 43 insertions(+), 2 deletions(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 9879d6793e90..162999b12790 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -4466,6 +4466,9 @@ struct bpf_btf_info {
>   __aligned_u64 btf;
>   __u32 btf_size;
>   __u32 id;
> + __aligned_u64 name;
> + __u32 name_len;
> + __u32 kernel_btf;
>  } __attribute__((aligned(8)));
>  
>  struct bpf_link_info {
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index 894ee33f4c84..663c3fb4e614 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -215,6 +215,8 @@ struct btf {
>   struct btf *base_btf;
>   u32 start_id; /* first type ID in this BTF (0 for base BTF) */
>   u32 start_str_off; /* first string offset (0 for base BTF) */
> + char name[MODULE_NAME_LEN];
> + bool kernel_btf;
>  };
>  
>  enum verifier_phase {
> @@ -4430,6 +4432,8 @@ struct btf *btf_parse_vmlinux(void)
>  
>   btf->data = __start_BTF;
>   btf->data_size = __stop_BTF - __start_BTF;
> + btf->kernel_btf = true;
> + snprintf(btf->name, sizeof(btf->name), "vmlinux");
>  
>   err = btf_parse_hdr(env);
>   if (err)
> @@ -4455,8 +4459,13 @@ struct btf *btf_parse_vmlinux(void)
>  
>   bpf_struct_ops_init(btf, log);
>  
> - btf_verifier_env_free(env);
>   refcount_set(&btf->refcnt, 1);
> +
> + err = btf_alloc_id(btf);
> + if (err)
> + goto errout;
> +
> + btf_verifier_env_free(env);
>   return btf;
>  
>  errout:
> @@ -5554,7 +5563,8 @@ int btf_get_info_by_fd(const struct btf *btf,
>   struct bpf_btf_info info;
>   u32 info_copy, btf_copy;
>   void __user *ubtf;
> - u32 uinfo_len;
> + char __user *uname;
> + u32 uinfo_len, uname_len, name_len;
>  
>   uinfo = u64_to_user_ptr(attr->info.info);
>   uinfo_len = attr->info.info_len;
> @@ -5571,6 +5581,31 @@ int btf_get_info_by_fd(const struct btf *btf,
>   return -EFAULT;
>   info.btf_size = btf->data_size;
>  
> + info.kernel_btf = btf->kernel_btf;
> +
> + uname = u64_to_user_ptr(info.name);
> + uname_len = info.name_len;
> + if (!uname ^ !uname_len)
> + return -EINVAL;
> +
> + name_len = strlen(btf->name);
> + info.name_len = name_len;
> +
> + if (uname) {
> + if (uname_len >= name_len + 1) {
> + if (copy_to_user(uname, btf->name, name_len + 1))
> + return -EFAULT;
> + } else {
> + char zero = '\0';
> +
> + if (copy_to_user(uname, btf->name, uname_len - 1))
> + return -EFAULT;
> + if (put_user(zero, uname + uname_len - 1))
> + return -EFAULT;
> + return -ENOSPC;
It should still do copy_to_user() even it will return -ENOSPC.

> + }
> + }
> +
>   if (copy_to_user(uinfo, &info, info_copy) ||
>   put_user(info_copy, &uattr->info.info_len))
>   return -EFAULT;


Re: [PATCHv2 net 1/2] selftest/bpf: add missed ip6ip6 test back

2020-11-09 Thread Martin KaFai Lau
On Mon, Nov 09, 2020 at 11:00:15AM +0800, Hangbin Liu wrote:
> On Fri, Nov 06, 2020 at 06:15:44PM -0800, Martin KaFai Lau wrote:
> > > - if (iph->nexthdr == 58 /* NEXTHDR_ICMP */) {
> > Same here. Can this check be kept?
> 
> Hi Martin,
> 
> I'm OK to keep the checking, then what about _ipip6_set_tunnel()? It also
> doesn't have the ICMP checking.
It should.  Otherwise, what is the point on testing
"data + sizeof(*iph) > data_end" without checking anything from iph?


Re: [PATCHv3 bpf 0/2] Remove unused test_ipip.sh test and add missed ip6ip6 test

2020-11-10 Thread Martin KaFai Lau
On Tue, Nov 10, 2020 at 09:50:11AM +0800, Hangbin Liu wrote:
> In comment 173ca26e9b51 ("samples/bpf: add comprehensive ipip, ipip6,
> ip6ip6 test") we added some bpf tunnel tests. In commit 933a741e3b82
> ("selftests/bpf: bpf tunnel test.") when we moved it to the current
> folder, we missed some points:
> 
> 1. ip6ip6 test is not added
> 2. forgot to remove test_ipip.sh in sample folder
> 3. TCP test code is not removed in test_tunnel_kern.c
> 
> In this patch set I add back ip6ip6 test and remove unused code. I'm not sure
> if this should be net or net-next, so just set to net.
> 
> Here is the test result:
> ```
> Testing IP6IP6 tunnel...
> PING ::11(::11) 56 data bytes
> 
> --- ::11 ping statistics ---
> 3 packets transmitted, 3 received, 0% packet loss, time 63ms
> rtt min/avg/max/mdev = 0.014/1028.308/2060.906/841.361 ms, pipe 2
> PING 1::11(1::11) 56 data bytes
> 
> --- 1::11 ping statistics ---
> 3 packets transmitted, 3 received, 0% packet loss, time 48ms
> rtt min/avg/max/mdev = 0.026/0.029/0.036/0.006 ms
> PING 1::22(1::22) 56 data bytes
> 
> --- 1::22 ping statistics ---
> 3 packets transmitted, 3 received, 0% packet loss, time 47ms
> rtt min/avg/max/mdev = 0.030/0.048/0.067/0.016 ms
> PASS: ip6ip6tnl
> ```
> 
> v3:
> Add back ICMP check as Martin suggested.
> 
> v2: Keep ip6ip6 section in test_tunnel_kern.c.
This should be for bpf-next.

Acked-by: Martin KaFai Lau 


Re: [PATCH bpf-next 2/3] bpf: Allow using bpf_sk_storage in FENTRY/FEXIT/RAW_TP

2020-11-10 Thread Martin KaFai Lau
On Tue, Nov 10, 2020 at 11:01:12PM +0100, KP Singh wrote:
> On Mon, Nov 9, 2020 at 9:32 PM John Fastabend  
> wrote:
> >
> > Andrii Nakryiko wrote:
> > > On Fri, Nov 6, 2020 at 5:52 PM Martin KaFai Lau  wrote:
> > > >
> > > > On Fri, Nov 06, 2020 at 05:14:14PM -0800, Andrii Nakryiko wrote:
> > > > > On Fri, Nov 6, 2020 at 2:08 PM Martin KaFai Lau  wrote:
> > > > > >
> > > > > > This patch enables the FENTRY/FEXIT/RAW_TP tracing program to use
> > > > > > the bpf_sk_storage_(get|delete) helper, so those tracing programs
> > > > > > can access the sk's bpf_local_storage and the later selftest
> > > > > > will show some examples.
> > > > > >
> > > > > > The bpf_sk_storage is currently used in bpf-tcp-cc, tc,
> > > > > > cg sockops...etc which is running either in softirq or
> > > > > > task context.
> > > > > >
> > > > > > This patch adds bpf_sk_storage_get_tracing_proto and
> > > > > > bpf_sk_storage_delete_tracing_proto.  They will check
> > > > > > in runtime that the helpers can only be called when serving
> > > > > > softirq or running in a task context.  That should enable
> > > > > > most common tracing use cases on sk.
> > > > > >
> > > > > > During the load time, the new tracing_allowed() function
> > > > > > will ensure the tracing prog using the bpf_sk_storage_(get|delete)
> > > > > > helper is not tracing any *sk_storage*() function itself.
> > > > > > The sk is passed as "void *" when calling into bpf_local_storage.
> > > > > >
> > > > > > Signed-off-by: Martin KaFai Lau 
> > > > > > ---
> > > > > >  include/net/bpf_sk_storage.h |  2 +
> > > > > >  kernel/trace/bpf_trace.c |  5 +++
> > > > > >  net/core/bpf_sk_storage.c| 73 
> > > > > > 
> > > > > >  3 files changed, 80 insertions(+)
> > > > > >
> > > > >
> > > > > [...]
> > > > >
> > > > > > +   switch (prog->expected_attach_type) {
> > > > > > +   case BPF_TRACE_RAW_TP:
> > > > > > +   /* bpf_sk_storage has no trace point */
> > > > > > +   return true;
> > > > > > +   case BPF_TRACE_FENTRY:
> > > > > > +   case BPF_TRACE_FEXIT:
> > > > > > +   btf_vmlinux = bpf_get_btf_vmlinux();
> > > > > > +   btf_id = prog->aux->attach_btf_id;
> > > > > > +   t = btf_type_by_id(btf_vmlinux, btf_id);
> > > > > > +   tname = btf_name_by_offset(btf_vmlinux, 
> > > > > > t->name_off);
> > > > > > +   return !strstr(tname, "sk_storage");
> > > > >
> > > > > I'm always feeling uneasy about substring checks... Also, KP just
> > > > > fixed the issue with string-based checks for LSM. Can we use a
> > > > > BTF_ID_SET of blacklisted functions instead?
> > > > KP one is different.  It accidentally whitelist-ed more than it should.
> > > >
> > > > It is a blacklist here.  It is actually cleaner and safer to blacklist
> > > > all functions with "sk_storage" and too pessimistic is fine here.
> > >
> > > Fine for whom? Prefix check would be half-bad, but substring check is
> > > horrible. Suddenly "task_storage" (and anything related) would be also
> > > blacklisted. Let's do a prefix check at least.
> > >
> >
> > Agree, prefix check sounds like a good idea. But, just doing a quick
> > grep seems like it will need at least bpf_sk_storage and sk_storage to
> > catch everything.
> 
> Is there any reason we are not using BTF ID sets and an allow list similar
> to bpf_d_path helper? (apart from the obvious inconvenience of
> needing to update the set in the kernel)
It is a blacklist here, a small recap from commit message.

> During the load time, the new tracing_allowed() function
> will ensure the tracing prog using the bpf_sk_storage_(get|delete)
> helper is not tracing any *sk_storage*() function itself.
> The sk is passed as "void *" when calling into bpf_local_storage.

Both BTF_ID and string-based (either prefix/substr) will work.

The intention is to first disallow a tracing program from tracing
any function in bpf_sk_storage.c and also calling the
bpf_sk_storage_(get|delete) helper at the same time.
This blacklist can be revisited later if there would
be a use case in some of the blacklist-ed
functions (which I doubt).

To use BTF_ID, it needs to consider about if the current (and future)
bpf_sk_storage function can be used in BTF_ID or not:
static, global/external, or inlined.

If BTF_ID is the best way for doing all black/white list, I don't mind
either.  I could force some to inline and we need to remember
to revisit the blacklist when the scope of fentry/fexit tracable
function changed, e.g. when static function becomes traceable
later.  The future changes to bpf_sk_storage.c will need to
adjust this list also.


Re: [PATCH bpf-next 2/3] bpf: Allow using bpf_sk_storage in FENTRY/FEXIT/RAW_TP

2020-11-10 Thread Martin KaFai Lau
On Tue, Nov 10, 2020 at 03:53:13PM -0800, Andrii Nakryiko wrote:
> On Tue, Nov 10, 2020 at 3:43 PM Martin KaFai Lau  wrote:
> >
> > On Tue, Nov 10, 2020 at 11:01:12PM +0100, KP Singh wrote:
> > > On Mon, Nov 9, 2020 at 9:32 PM John Fastabend  
> > > wrote:
> > > >
> > > > Andrii Nakryiko wrote:
> > > > > On Fri, Nov 6, 2020 at 5:52 PM Martin KaFai Lau  wrote:
> > > > > >
> > > > > > On Fri, Nov 06, 2020 at 05:14:14PM -0800, Andrii Nakryiko wrote:
> > > > > > > On Fri, Nov 6, 2020 at 2:08 PM Martin KaFai Lau  
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > This patch enables the FENTRY/FEXIT/RAW_TP tracing program to 
> > > > > > > > use
> > > > > > > > the bpf_sk_storage_(get|delete) helper, so those tracing 
> > > > > > > > programs
> > > > > > > > can access the sk's bpf_local_storage and the later selftest
> > > > > > > > will show some examples.
> > > > > > > >
> > > > > > > > The bpf_sk_storage is currently used in bpf-tcp-cc, tc,
> > > > > > > > cg sockops...etc which is running either in softirq or
> > > > > > > > task context.
> > > > > > > >
> > > > > > > > This patch adds bpf_sk_storage_get_tracing_proto and
> > > > > > > > bpf_sk_storage_delete_tracing_proto.  They will check
> > > > > > > > in runtime that the helpers can only be called when serving
> > > > > > > > softirq or running in a task context.  That should enable
> > > > > > > > most common tracing use cases on sk.
> > > > > > > >
> > > > > > > > During the load time, the new tracing_allowed() function
> > > > > > > > will ensure the tracing prog using the 
> > > > > > > > bpf_sk_storage_(get|delete)
> > > > > > > > helper is not tracing any *sk_storage*() function itself.
> > > > > > > > The sk is passed as "void *" when calling into 
> > > > > > > > bpf_local_storage.
> > > > > > > >
> > > > > > > > Signed-off-by: Martin KaFai Lau 
> > > > > > > > ---
> > > > > > > >  include/net/bpf_sk_storage.h |  2 +
> > > > > > > >  kernel/trace/bpf_trace.c |  5 +++
> > > > > > > >  net/core/bpf_sk_storage.c| 73 
> > > > > > > > 
> > > > > > > >  3 files changed, 80 insertions(+)
> > > > > > > >
> > > > > > >
> > > > > > > [...]
> > > > > > >
> > > > > > > > +   switch (prog->expected_attach_type) {
> > > > > > > > +   case BPF_TRACE_RAW_TP:
> > > > > > > > +   /* bpf_sk_storage has no trace point */
> > > > > > > > +   return true;
> > > > > > > > +   case BPF_TRACE_FENTRY:
> > > > > > > > +   case BPF_TRACE_FEXIT:
> > > > > > > > +   btf_vmlinux = bpf_get_btf_vmlinux();
> > > > > > > > +   btf_id = prog->aux->attach_btf_id;
> > > > > > > > +   t = btf_type_by_id(btf_vmlinux, btf_id);
> > > > > > > > +   tname = btf_name_by_offset(btf_vmlinux, 
> > > > > > > > t->name_off);
> > > > > > > > +   return !strstr(tname, "sk_storage");
> > > > > > >
> > > > > > > I'm always feeling uneasy about substring checks... Also, KP just
> > > > > > > fixed the issue with string-based checks for LSM. Can we use a
> > > > > > > BTF_ID_SET of blacklisted functions instead?
> > > > > > KP one is different.  It accidentally whitelist-ed more than it 
> > > > > > should.
> > > > > >
> > > > > > It is a blacklist here.  It is actually cleaner and safer to 
> > > > > > blacklist
> > > > > > all functions with "sk_storage" and too pessimistic is fine here.
> > > > >
> > > > > Fine fo

Re: [PATCH bpf-next 2/3] bpf: Allow using bpf_sk_storage in FENTRY/FEXIT/RAW_TP

2020-11-10 Thread Martin KaFai Lau
On Tue, Nov 10, 2020 at 04:17:06PM -0800, Andrii Nakryiko wrote:
> On Tue, Nov 10, 2020 at 4:07 PM Martin KaFai Lau  wrote:
> >
> > On Tue, Nov 10, 2020 at 03:53:13PM -0800, Andrii Nakryiko wrote:
> > > On Tue, Nov 10, 2020 at 3:43 PM Martin KaFai Lau  wrote:
> > > >
> > > > On Tue, Nov 10, 2020 at 11:01:12PM +0100, KP Singh wrote:
> > > > > On Mon, Nov 9, 2020 at 9:32 PM John Fastabend 
> > > > >  wrote:
> > > > > >
> > > > > > Andrii Nakryiko wrote:
> > > > > > > On Fri, Nov 6, 2020 at 5:52 PM Martin KaFai Lau  
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > On Fri, Nov 06, 2020 at 05:14:14PM -0800, Andrii Nakryiko wrote:
> > > > > > > > > On Fri, Nov 6, 2020 at 2:08 PM Martin KaFai Lau 
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > This patch enables the FENTRY/FEXIT/RAW_TP tracing program 
> > > > > > > > > > to use
> > > > > > > > > > the bpf_sk_storage_(get|delete) helper, so those tracing 
> > > > > > > > > > programs
> > > > > > > > > > can access the sk's bpf_local_storage and the later selftest
> > > > > > > > > > will show some examples.
> > > > > > > > > >
> > > > > > > > > > The bpf_sk_storage is currently used in bpf-tcp-cc, tc,
> > > > > > > > > > cg sockops...etc which is running either in softirq or
> > > > > > > > > > task context.
> > > > > > > > > >
> > > > > > > > > > This patch adds bpf_sk_storage_get_tracing_proto and
> > > > > > > > > > bpf_sk_storage_delete_tracing_proto.  They will check
> > > > > > > > > > in runtime that the helpers can only be called when serving
> > > > > > > > > > softirq or running in a task context.  That should enable
> > > > > > > > > > most common tracing use cases on sk.
> > > > > > > > > >
> > > > > > > > > > During the load time, the new tracing_allowed() function
> > > > > > > > > > will ensure the tracing prog using the 
> > > > > > > > > > bpf_sk_storage_(get|delete)
> > > > > > > > > > helper is not tracing any *sk_storage*() function itself.
> > > > > > > > > > The sk is passed as "void *" when calling into 
> > > > > > > > > > bpf_local_storage.
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Martin KaFai Lau 
> > > > > > > > > > ---
> > > > > > > > > >  include/net/bpf_sk_storage.h |  2 +
> > > > > > > > > >  kernel/trace/bpf_trace.c |  5 +++
> > > > > > > > > >  net/core/bpf_sk_storage.c| 73 
> > > > > > > > > > 
> > > > > > > > > >  3 files changed, 80 insertions(+)
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > [...]
> > > > > > > > >
> > > > > > > > > > +   switch (prog->expected_attach_type) {
> > > > > > > > > > +   case BPF_TRACE_RAW_TP:
> > > > > > > > > > +   /* bpf_sk_storage has no trace point */
> > > > > > > > > > +   return true;
> > > > > > > > > > +   case BPF_TRACE_FENTRY:
> > > > > > > > > > +   case BPF_TRACE_FEXIT:
> > > > > > > > > > +   btf_vmlinux = bpf_get_btf_vmlinux();
> > > > > > > > > > +   btf_id = prog->aux->attach_btf_id;
> > > > > > > > > > +   t = btf_type_by_id(btf_vmlinux, btf_id);
> > > > > > > > > > +   tname = btf_name_by_offset(btf_vmlinux, 
> > > > > > > > > > t->name_off);
> > > > > > > > > > +   return !strstr(tname, "sk_storage&qu

[PATCH v2 bpf-next 0/4] bpf: Enable bpf_sk_storage for FENTRY/FEXIT/RAW_TP

2020-11-12 Thread Martin KaFai Lau
This set is to allow the FENTRY/FEXIT/RAW_TP tracing program to use
bpf_sk_storage.  The first two patches are a cleanup.  The last patch is
tests.  Patch 3 has the required kernel changes to
enable bpf_sk_storage for FENTRY/FEXIT/RAW_TP.

Please see individual patch for details.

v2:
- Rename some of the function prefix from sk_storage to bpf_sk_storage
- Use prefix check instead of substr check

Martin KaFai Lau (4):
  bpf: Folding omem_charge() into sk_storage_charge()
  bpf: Rename some functions in bpf_sk_storage
  bpf: Allow using bpf_sk_storage in FENTRY/FEXIT/RAW_TP
  bpf: selftest: Use bpf_sk_storage in FENTRY/FEXIT/RAW_TP

 include/net/bpf_sk_storage.h  |   2 +
 kernel/trace/bpf_trace.c  |   5 +
 net/core/bpf_sk_storage.c | 135 +-
 .../bpf/prog_tests/sk_storage_tracing.c   | 135 ++
 .../bpf/progs/test_sk_storage_trace_itself.c  |  29 
 .../bpf/progs/test_sk_storage_tracing.c   |  95 
 6 files changed, 369 insertions(+), 32 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_storage_tracing.c
 create mode 100644 
tools/testing/selftests/bpf/progs/test_sk_storage_trace_itself.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_sk_storage_tracing.c

-- 
2.24.1



[PATCH v2 bpf-next 3/4] bpf: Allow using bpf_sk_storage in FENTRY/FEXIT/RAW_TP

2020-11-12 Thread Martin KaFai Lau
This patch enables the FENTRY/FEXIT/RAW_TP tracing program to use
the bpf_sk_storage_(get|delete) helper, so those tracing programs
can access the sk's bpf_local_storage and the later selftest
will show some examples.

The bpf_sk_storage is currently used in bpf-tcp-cc, tc,
cg sockops...etc which is running either in softirq or
task context.

This patch adds bpf_sk_storage_get_tracing_proto and
bpf_sk_storage_delete_tracing_proto.  They will check
in runtime that the helpers can only be called when serving
softirq or running in a task context.  That should enable
most common tracing use cases on sk.

During the load time, the new tracing_allowed() function
will ensure the tracing prog using the bpf_sk_storage_(get|delete)
helper is not tracing any bpf_sk_storage*() function itself.
The sk is passed as "void *" when calling into bpf_local_storage.

This patch only allows tracing a kernel function.

Acked-by: Song Liu 
Signed-off-by: Martin KaFai Lau 
---
 include/net/bpf_sk_storage.h |  2 +
 kernel/trace/bpf_trace.c |  5 +++
 net/core/bpf_sk_storage.c| 74 
 3 files changed, 81 insertions(+)

diff --git a/include/net/bpf_sk_storage.h b/include/net/bpf_sk_storage.h
index 3c516dd07caf..0e85713f56df 100644
--- a/include/net/bpf_sk_storage.h
+++ b/include/net/bpf_sk_storage.h
@@ -20,6 +20,8 @@ void bpf_sk_storage_free(struct sock *sk);
 
 extern const struct bpf_func_proto bpf_sk_storage_get_proto;
 extern const struct bpf_func_proto bpf_sk_storage_delete_proto;
+extern const struct bpf_func_proto bpf_sk_storage_get_tracing_proto;
+extern const struct bpf_func_proto bpf_sk_storage_delete_tracing_proto;
 
 struct bpf_local_storage_elem;
 struct bpf_sk_storage_diag;
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index e4515b0f62a8..cfce60ad1cb5 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1735,6 +1736,10 @@ tracing_prog_func_proto(enum bpf_func_id func_id, const 
struct bpf_prog *prog)
return &bpf_skc_to_tcp_request_sock_proto;
case BPF_FUNC_skc_to_udp6_sock:
return &bpf_skc_to_udp6_sock_proto;
+   case BPF_FUNC_sk_storage_get:
+   return &bpf_sk_storage_get_tracing_proto;
+   case BPF_FUNC_sk_storage_delete:
+   return &bpf_sk_storage_delete_tracing_proto;
 #endif
case BPF_FUNC_seq_printf:
return prog->expected_attach_type == BPF_TRACE_ITER ?
diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
index fd416678f236..359908a7d3c1 100644
--- a/net/core/bpf_sk_storage.c
+++ b/net/core/bpf_sk_storage.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -378,6 +379,79 @@ const struct bpf_func_proto bpf_sk_storage_delete_proto = {
.arg2_type  = ARG_PTR_TO_BTF_ID_SOCK_COMMON,
 };
 
+static bool bpf_sk_storage_tracing_allowed(const struct bpf_prog *prog)
+{
+   const struct btf *btf_vmlinux;
+   const struct btf_type *t;
+   const char *tname;
+   u32 btf_id;
+
+   if (prog->aux->dst_prog)
+   return false;
+
+   /* Ensure the tracing program is not tracing
+* any bpf_sk_storage*() function and also
+* use the bpf_sk_storage_(get|delete) helper.
+*/
+   switch (prog->expected_attach_type) {
+   case BPF_TRACE_RAW_TP:
+   /* bpf_sk_storage has no trace point */
+   return true;
+   case BPF_TRACE_FENTRY:
+   case BPF_TRACE_FEXIT:
+   btf_vmlinux = bpf_get_btf_vmlinux();
+   btf_id = prog->aux->attach_btf_id;
+   t = btf_type_by_id(btf_vmlinux, btf_id);
+   tname = btf_name_by_offset(btf_vmlinux, t->name_off);
+   return !!strncmp(tname, "bpf_sk_storage",
+strlen("bpf_sk_storage"));
+   default:
+   return false;
+   }
+
+   return false;
+}
+
+BPF_CALL_4(bpf_sk_storage_get_tracing, struct bpf_map *, map, struct sock *, 
sk,
+  void *, value, u64, flags)
+{
+   if (!in_serving_softirq() && !in_task())
+   return (unsigned long)NULL;
+
+   return (unsigned long)bpf_sk_storage_get(map, sk, value, flags);
+}
+
+BPF_CALL_2(bpf_sk_storage_delete_tracing, struct bpf_map *, map,
+  struct sock *, sk)
+{
+   if (!in_serving_softirq() && !in_task())
+   return -EPERM;
+
+   return bpf_sk_storage_delete(map, sk);
+}
+
+const struct bpf_func_proto bpf_sk_storage_get_tracing_proto = {
+   .func   = bpf_sk_storage_get_tracing,
+   .gpl_only   = false,
+   .ret_type   = RET_PTR_TO_MAP_VALUE_OR_NULL,
+   .arg1_type  = ARG_CONST_MAP_PTR,
+   .arg2_type  = ARG_PTR_TO_BTF_ID

[PATCH v2 bpf-next 1/4] bpf: Folding omem_charge() into sk_storage_charge()

2020-11-12 Thread Martin KaFai Lau
sk_storage_charge() is the only user of omem_charge().
This patch simplifies it by folding omem_charge() into
sk_storage_charge().

Acked-by: Song Liu 
Signed-off-by: Martin KaFai Lau 
---
 net/core/bpf_sk_storage.c | 23 ++-
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
index c907f0dc7f87..001eac65e40f 100644
--- a/net/core/bpf_sk_storage.c
+++ b/net/core/bpf_sk_storage.c
@@ -15,18 +15,6 @@
 
 DEFINE_BPF_STORAGE_CACHE(sk_cache);
 
-static int omem_charge(struct sock *sk, unsigned int size)
-{
-   /* same check as in sock_kmalloc() */
-   if (size <= sysctl_optmem_max &&
-   atomic_read(&sk->sk_omem_alloc) + size < sysctl_optmem_max) {
-   atomic_add(size, &sk->sk_omem_alloc);
-   return 0;
-   }
-
-   return -ENOMEM;
-}
-
 static struct bpf_local_storage_data *
 sk_storage_lookup(struct sock *sk, struct bpf_map *map, bool cacheit_lockit)
 {
@@ -316,7 +304,16 @@ BPF_CALL_2(bpf_sk_storage_delete, struct bpf_map *, map, 
struct sock *, sk)
 static int sk_storage_charge(struct bpf_local_storage_map *smap,
 void *owner, u32 size)
 {
-   return omem_charge(owner, size);
+   struct sock *sk = (struct sock *)owner;
+
+   /* same check as in sock_kmalloc() */
+   if (size <= sysctl_optmem_max &&
+   atomic_read(&sk->sk_omem_alloc) + size < sysctl_optmem_max) {
+   atomic_add(size, &sk->sk_omem_alloc);
+   return 0;
+   }
+
+   return -ENOMEM;
 }
 
 static void sk_storage_uncharge(struct bpf_local_storage_map *smap,
-- 
2.24.1



[PATCH v2 bpf-next 2/4] bpf: Rename some functions in bpf_sk_storage

2020-11-12 Thread Martin KaFai Lau
Rename some of the functions currently prefixed with sk_storage
to bpf_sk_storage.  That will make the next patch have fewer
prefix check and also bring the bpf_sk_storage.c to a more
consistent function naming.

Signed-off-by: Martin KaFai Lau 
---
 net/core/bpf_sk_storage.c | 38 +++---
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
index 001eac65e40f..fd416678f236 100644
--- a/net/core/bpf_sk_storage.c
+++ b/net/core/bpf_sk_storage.c
@@ -16,7 +16,7 @@
 DEFINE_BPF_STORAGE_CACHE(sk_cache);
 
 static struct bpf_local_storage_data *
-sk_storage_lookup(struct sock *sk, struct bpf_map *map, bool cacheit_lockit)
+bpf_sk_storage_lookup(struct sock *sk, struct bpf_map *map, bool 
cacheit_lockit)
 {
struct bpf_local_storage *sk_storage;
struct bpf_local_storage_map *smap;
@@ -29,11 +29,11 @@ sk_storage_lookup(struct sock *sk, struct bpf_map *map, 
bool cacheit_lockit)
return bpf_local_storage_lookup(sk_storage, smap, cacheit_lockit);
 }
 
-static int sk_storage_delete(struct sock *sk, struct bpf_map *map)
+static int bpf_sk_storage_del(struct sock *sk, struct bpf_map *map)
 {
struct bpf_local_storage_data *sdata;
 
-   sdata = sk_storage_lookup(sk, map, false);
+   sdata = bpf_sk_storage_lookup(sk, map, false);
if (!sdata)
return -ENOENT;
 
@@ -82,7 +82,7 @@ void bpf_sk_storage_free(struct sock *sk)
kfree_rcu(sk_storage, rcu);
 }
 
-static void sk_storage_map_free(struct bpf_map *map)
+static void bpf_sk_storage_map_free(struct bpf_map *map)
 {
struct bpf_local_storage_map *smap;
 
@@ -91,7 +91,7 @@ static void sk_storage_map_free(struct bpf_map *map)
bpf_local_storage_map_free(smap);
 }
 
-static struct bpf_map *sk_storage_map_alloc(union bpf_attr *attr)
+static struct bpf_map *bpf_sk_storage_map_alloc(union bpf_attr *attr)
 {
struct bpf_local_storage_map *smap;
 
@@ -118,7 +118,7 @@ static void *bpf_fd_sk_storage_lookup_elem(struct bpf_map 
*map, void *key)
fd = *(int *)key;
sock = sockfd_lookup(fd, &err);
if (sock) {
-   sdata = sk_storage_lookup(sock->sk, map, true);
+   sdata = bpf_sk_storage_lookup(sock->sk, map, true);
sockfd_put(sock);
return sdata ? sdata->data : NULL;
}
@@ -154,7 +154,7 @@ static int bpf_fd_sk_storage_delete_elem(struct bpf_map 
*map, void *key)
fd = *(int *)key;
sock = sockfd_lookup(fd, &err);
if (sock) {
-   err = sk_storage_delete(sock->sk, map);
+   err = bpf_sk_storage_del(sock->sk, map);
sockfd_put(sock);
return err;
}
@@ -260,7 +260,7 @@ BPF_CALL_4(bpf_sk_storage_get, struct bpf_map *, map, 
struct sock *, sk,
if (!sk || !sk_fullsock(sk) || flags > BPF_SK_STORAGE_GET_F_CREATE)
return (unsigned long)NULL;
 
-   sdata = sk_storage_lookup(sk, map, true);
+   sdata = bpf_sk_storage_lookup(sk, map, true);
if (sdata)
return (unsigned long)sdata->data;
 
@@ -293,7 +293,7 @@ BPF_CALL_2(bpf_sk_storage_delete, struct bpf_map *, map, 
struct sock *, sk)
if (refcount_inc_not_zero(&sk->sk_refcnt)) {
int err;
 
-   err = sk_storage_delete(sk, map);
+   err = bpf_sk_storage_del(sk, map);
sock_put(sk);
return err;
}
@@ -301,8 +301,8 @@ BPF_CALL_2(bpf_sk_storage_delete, struct bpf_map *, map, 
struct sock *, sk)
return -ENOENT;
 }
 
-static int sk_storage_charge(struct bpf_local_storage_map *smap,
-void *owner, u32 size)
+static int bpf_sk_storage_charge(struct bpf_local_storage_map *smap,
+void *owner, u32 size)
 {
struct sock *sk = (struct sock *)owner;
 
@@ -316,8 +316,8 @@ static int sk_storage_charge(struct bpf_local_storage_map 
*smap,
return -ENOMEM;
 }
 
-static void sk_storage_uncharge(struct bpf_local_storage_map *smap,
-   void *owner, u32 size)
+static void bpf_sk_storage_uncharge(struct bpf_local_storage_map *smap,
+   void *owner, u32 size)
 {
struct sock *sk = owner;
 
@@ -325,7 +325,7 @@ static void sk_storage_uncharge(struct 
bpf_local_storage_map *smap,
 }
 
 static struct bpf_local_storage __rcu **
-sk_storage_ptr(void *owner)
+bpf_sk_storage_ptr(void *owner)
 {
struct sock *sk = owner;
 
@@ -336,8 +336,8 @@ static int sk_storage_map_btf_id;
 const struct bpf_map_ops sk_storage_map_ops = {
.map_meta_equal = bpf_map_meta_equal,
.map_alloc_check = bpf_local_storage_map_alloc_check,
-   .map_alloc = sk_storage_map_alloc,
-   .map_free = sk_storage_map_free,
+   .map_alloc = bpf_sk_storage_map_alloc,
+   .map_free = bpf_sk_storage

[PATCH v2 bpf-next 4/4] bpf: selftest: Use bpf_sk_storage in FENTRY/FEXIT/RAW_TP

2020-11-12 Thread Martin KaFai Lau
This patch tests storing the task's related info into the
bpf_sk_storage by fentry/fexit tracing at listen, accept,
and connect.  It also tests the raw_tp at inet_sock_set_state.

A negative test is done by tracing the bpf_sk_storage_free()
and using bpf_sk_storage_get() at the same time.  It ensures
this bpf program cannot load.

Signed-off-by: Martin KaFai Lau 
---
 .../bpf/prog_tests/sk_storage_tracing.c   | 135 ++
 .../bpf/progs/test_sk_storage_trace_itself.c  |  29 
 .../bpf/progs/test_sk_storage_tracing.c   |  95 
 3 files changed, 259 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_storage_tracing.c
 create mode 100644 
tools/testing/selftests/bpf/progs/test_sk_storage_trace_itself.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_sk_storage_tracing.c

diff --git a/tools/testing/selftests/bpf/prog_tests/sk_storage_tracing.c 
b/tools/testing/selftests/bpf/prog_tests/sk_storage_tracing.c
new file mode 100644
index ..2b392590e8ca
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/sk_storage_tracing.c
@@ -0,0 +1,135 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+
+#include 
+#include 
+#include 
+#include "test_progs.h"
+#include "network_helpers.h"
+#include "test_sk_storage_trace_itself.skel.h"
+#include "test_sk_storage_tracing.skel.h"
+
+#define LO_ADDR6 "::1"
+#define TEST_COMM "test_progs"
+
+struct sk_stg {
+   __u32 pid;
+   __u32 last_notclose_state;
+   char comm[16];
+};
+
+static struct test_sk_storage_tracing *skel;
+static __u32 duration;
+static pid_t my_pid;
+
+static int check_sk_stg(int sk_fd, __u32 expected_state)
+{
+   struct sk_stg sk_stg;
+   int err;
+
+   err = bpf_map_lookup_elem(bpf_map__fd(skel->maps.sk_stg_map), &sk_fd,
+ &sk_stg);
+   if (!ASSERT_OK(err, "map_lookup(sk_stg_map)"))
+   return -1;
+
+   if (!ASSERT_EQ(sk_stg.last_notclose_state, expected_state,
+  "last_notclose_state"))
+   return -1;
+
+   if (!ASSERT_EQ(sk_stg.pid, my_pid, "pid"))
+   return -1;
+
+   if (!ASSERT_STREQ(sk_stg.comm, skel->bss->task_comm, "task_comm"))
+   return -1;
+
+   return 0;
+}
+
+static void do_test(void)
+{
+   int listen_fd = -1, passive_fd = -1, active_fd = -1, value = 1, err;
+   char abyte;
+
+   listen_fd = start_server(AF_INET6, SOCK_STREAM, LO_ADDR6, 0, 0);
+   if (CHECK(listen_fd == -1, "start_server",
+ "listen_fd:%d errno:%d\n", listen_fd, errno))
+   return;
+
+   active_fd = connect_to_fd(listen_fd, 0);
+   if (CHECK(active_fd == -1, "connect_to_fd", "active_fd:%d errno:%d\n",
+ active_fd, errno))
+   goto out;
+
+   err = bpf_map_update_elem(bpf_map__fd(skel->maps.del_sk_stg_map),
+ &active_fd, &value, 0);
+   if (!ASSERT_OK(err, "map_update(del_sk_stg_map)"))
+   goto out;
+
+   passive_fd = accept(listen_fd, NULL, 0);
+   if (CHECK(passive_fd == -1, "accept", "passive_fd:%d errno:%d\n",
+ passive_fd, errno))
+   goto out;
+
+   shutdown(active_fd, SHUT_WR);
+   err = read(passive_fd, &abyte, 1);
+   if (!ASSERT_OK(err, "read(passive_fd)"))
+   goto out;
+
+   shutdown(passive_fd, SHUT_WR);
+   err = read(active_fd, &abyte, 1);
+   if (!ASSERT_OK(err, "read(active_fd)"))
+   goto out;
+
+   err = bpf_map_lookup_elem(bpf_map__fd(skel->maps.del_sk_stg_map),
+ &active_fd, &value);
+   if (!ASSERT_ERR(err, "map_lookup(del_sk_stg_map)"))
+   goto out;
+
+   err = check_sk_stg(listen_fd, BPF_TCP_LISTEN);
+   if (!ASSERT_OK(err, "listen_fd sk_stg"))
+   goto out;
+
+   err = check_sk_stg(active_fd, BPF_TCP_FIN_WAIT2);
+   if (!ASSERT_OK(err, "active_fd sk_stg"))
+   goto out;
+
+   err = check_sk_stg(passive_fd, BPF_TCP_LAST_ACK);
+   ASSERT_OK(err, "passive_fd sk_stg");
+
+out:
+   if (active_fd != -1)
+   close(active_fd);
+   if (passive_fd != -1)
+   close(passive_fd);
+   if (listen_fd != -1)
+   close(listen_fd);
+}
+
+void test_sk_storage_tracing(void)
+{
+   struct test_sk_storage_trace_itself *skel_itself;
+   int err;
+
+   my_pid = getpid();
+
+   skel_itself = test_sk_storage_trace_itself__open_and_load();
+
+   if (!ASSERT_NULL(skel_itself, "test_sk_storage_trace_itself")) {
+   test_sk_storage_tra

Re: [PATCH v2 bpf-next 3/4] bpf: Allow using bpf_sk_storage in FENTRY/FEXIT/RAW_TP

2020-11-16 Thread Martin KaFai Lau
On Sat, Nov 14, 2020 at 05:17:20PM -0800, Jakub Kicinski wrote:
> On Thu, 12 Nov 2020 13:13:13 -0800 Martin KaFai Lau wrote:
> > This patch adds bpf_sk_storage_get_tracing_proto and
> > bpf_sk_storage_delete_tracing_proto.  They will check
> > in runtime that the helpers can only be called when serving
> > softirq or running in a task context.  That should enable
> > most common tracing use cases on sk.
> 
> > +   if (!in_serving_softirq() && !in_task())
> 
> This is a curious combination of checks. Would you mind indulging me
> with an explanation?
The current lock usage in bpf_local_storage.c is only expected to
run in either of these contexts.


Re: [PATCH v2 bpf-next 3/4] bpf: Allow using bpf_sk_storage in FENTRY/FEXIT/RAW_TP

2020-11-16 Thread Martin KaFai Lau
On Mon, Nov 16, 2020 at 10:00:04AM -0800, Jakub Kicinski wrote:
> On Mon, 16 Nov 2020 09:37:34 -0800 Martin KaFai Lau wrote:
> > On Sat, Nov 14, 2020 at 05:17:20PM -0800, Jakub Kicinski wrote:
> > > On Thu, 12 Nov 2020 13:13:13 -0800 Martin KaFai Lau wrote:  
> > > > This patch adds bpf_sk_storage_get_tracing_proto and
> > > > bpf_sk_storage_delete_tracing_proto.  They will check
> > > > in runtime that the helpers can only be called when serving
> > > > softirq or running in a task context.  That should enable
> > > > most common tracing use cases on sk.  
> > >   
> > > > +   if (!in_serving_softirq() && !in_task())  
> > > 
> > > This is a curious combination of checks. Would you mind indulging me
> > > with an explanation?  
> > The current lock usage in bpf_local_storage.c is only expected to
> > run in either of these contexts.
> 
> :)
> 
> Locks that can run in any context but preempt disabled or softirq
> disabled?
Not exactly. e.g. running from irq won't work.

> 
> Let me cut to the chase. Are you sure you didn't mean to check
> if (irq_count()) ?
so, no.

>From preempt.h:

/*
 * ...
 * in_interrupt() - We're in NMI,IRQ,SoftIRQ context or have BH disabled
 * ...
 */
#define in_interrupt()  (irq_count())


Re: [PATCH v2 bpf-next 3/4] bpf: Allow using bpf_sk_storage in FENTRY/FEXIT/RAW_TP

2020-11-16 Thread Martin KaFai Lau
On Mon, Nov 16, 2020 at 10:43:40AM -0800, Jakub Kicinski wrote:
> On Mon, 16 Nov 2020 10:37:49 -0800 Martin KaFai Lau wrote:
> > On Mon, Nov 16, 2020 at 10:00:04AM -0800, Jakub Kicinski wrote:
> > > Locks that can run in any context but preempt disabled or softirq
> > > disabled?  
> > Not exactly. e.g. running from irq won't work.
> > 
> > > Let me cut to the chase. Are you sure you didn't mean to check
> > > if (irq_count()) ?  
> > so, no.
> > 
> > From preempt.h:
> > 
> > /*
> >  * ...
> >  * in_interrupt() - We're in NMI,IRQ,SoftIRQ context or have BH disabled
> >  * ...
> >  */
> > #define in_interrupt()  (irq_count())
> 
> Right, as I said in my correction (in_irq() || in_nmi()).
> 
> Just to spell it out AFAIU in_serving_softirq() will return true when
> softirq is active and interrupted by a hard irq or an NMI.
I see what you have been getting at now.

Good point. will post a fix.


[PATCH bpf-next] bpf: Fix the irq and nmi check in bpf_sk_storage for tracing usage

2020-11-16 Thread Martin KaFai Lau
The intention of the current check is to avoid using bpf_sk_storage
in irq and nmi.  Jakub pointed out that the current check cannot
do that.  For example, in_serving_softirq() returns true
if the softirq handling is interrupted by hard irq.

Fixes: 8e4597c627fb ("bpf: Allow using bpf_sk_storage in FENTRY/FEXIT/RAW_TP")
Suggested-by: Jakub Kicinski 
Signed-off-by: Martin KaFai Lau 
---
 net/core/bpf_sk_storage.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
index 359908a7d3c1..a32037daa933 100644
--- a/net/core/bpf_sk_storage.c
+++ b/net/core/bpf_sk_storage.c
@@ -415,7 +415,7 @@ static bool bpf_sk_storage_tracing_allowed(const struct 
bpf_prog *prog)
 BPF_CALL_4(bpf_sk_storage_get_tracing, struct bpf_map *, map, struct sock *, 
sk,
   void *, value, u64, flags)
 {
-   if (!in_serving_softirq() && !in_task())
+   if (in_irq() || in_nmi())
return (unsigned long)NULL;
 
return (unsigned long)bpf_sk_storage_get(map, sk, value, flags);
@@ -424,7 +424,7 @@ BPF_CALL_4(bpf_sk_storage_get_tracing, struct bpf_map *, 
map, struct sock *, sk,
 BPF_CALL_2(bpf_sk_storage_delete_tracing, struct bpf_map *, map,
   struct sock *, sk)
 {
-   if (!in_serving_softirq() && !in_task())
+   if (in_irq() || in_nmi())
return -EPERM;
 
return bpf_sk_storage_delete(map, sk);
-- 
2.24.1



Re: [FIX bpf,perf] bpf,perf: return EOPNOTSUPP for bpf handler on PERF_COUNT_SW_DUMMY

2020-11-16 Thread Martin KaFai Lau
On Mon, Nov 16, 2020 at 07:37:52PM +0100, Florian Lehner wrote:
> bpf handlers for perf events other than tracepoints, kprobes or uprobes
> are attached to the overflow_handler of the perf event.
> 
> Perf events of type software/dummy are placeholder events. So when
> attaching a bpf handle to an overflow_handler of such an event, the bpf
> handler will not be triggered.
> 
> This fix returns the error EOPNOTSUPP to indicate that attaching a bpf
> handler to a perf event of type software/dummy is not supported.
> 
> Signed-off-by: Florian Lehner 
It is missing a Fixes tag.


Re: [PATCH bpf-next] bpf: Add bpf_ktime_get_coarse_ns helper

2020-11-17 Thread Martin KaFai Lau
On Tue, Nov 17, 2020 at 06:45:49PM +, Dmitrii Banshchikov wrote:
> The helper uses CLOCK_MONOTONIC_COARSE source of time that is less
> accurate but more performant.
> 
> We have a BPF CGROUP_SKB firewall that supports event logging through
> bpf_perf_event_output(). Each event has a timestamp and currently we use
> bpf_ktime_get_ns() for it. Use of bpf_ktime_get_coarse_ns() saves ~15-20
> ns in time required for event logging.
> 
> bpf_ktime_get_ns():
> EgressLogByRemoteEndpoint  113.82ns8.79M
> bpf_ktime_get_coarse_ns():
> EgressLogByRemoteEndpoint       95.40ns   10.48M
Acked-by: Martin KaFai Lau 


Re: [PATCH bpf-next 1/9] selftests: bpf: move tracing helpers to trace_helper

2020-11-17 Thread Martin KaFai Lau
On Tue, Nov 17, 2020 at 02:56:36PM +, Daniel T. Lee wrote:
> Under the samples/bpf directory, similar tracing helpers are
> fragmented around. To keep consistent of tracing programs, this commit
> moves the helper and define locations to increase the reuse of each
> helper function.
> 
> Signed-off-by: Daniel T. Lee 
> 
> ---
>  samples/bpf/Makefile|  2 +-
>  samples/bpf/hbm.c   | 51 -
>  tools/testing/selftests/bpf/trace_helpers.c | 33 -
>  tools/testing/selftests/bpf/trace_helpers.h |  3 ++
>  4 files changed, 45 insertions(+), 44 deletions(-)
> 
> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> index aeebf5d12f32..3e83cd5ca1c2 100644
> --- a/samples/bpf/Makefile
> +++ b/samples/bpf/Makefile
> @@ -110,7 +110,7 @@ xdp_fwd-objs := xdp_fwd_user.o
>  task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
>  xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS)
>  ibumad-objs := bpf_load.o ibumad_user.o $(TRACE_HELPERS)
> -hbm-objs := bpf_load.o hbm.o $(CGROUP_HELPERS)
> +hbm-objs := bpf_load.o hbm.o $(CGROUP_HELPERS) $(TRACE_HELPERS)
>  
>  # Tell kbuild to always build the programs
>  always-y := $(tprogs-y)
> diff --git a/samples/bpf/hbm.c b/samples/bpf/hbm.c
> index 400e741a56eb..b9f9f771dd81 100644
> --- a/samples/bpf/hbm.c
> +++ b/samples/bpf/hbm.c
> @@ -48,6 +48,7 @@
>  
>  #include "bpf_load.h"
>  #include "bpf_rlimit.h"
> +#include "trace_helpers.h"
>  #include "cgroup_helpers.h"
>  #include "hbm.h"
>  #include "bpf_util.h"
> @@ -65,51 +66,12 @@ bool no_cn_flag;
>  bool edt_flag;
>  
>  static void Usage(void);
> -static void read_trace_pipe2(void);
>  static void do_error(char *msg, bool errno_flag);
>  
> -#define DEBUGFS "/sys/kernel/debug/tracing/"
> -
>  struct bpf_object *obj;
>  int bpfprog_fd;
>  int cgroup_storage_fd;
>  
> -static void read_trace_pipe2(void)
> -{
> - int trace_fd;
> - FILE *outf;
> - char *outFname = "hbm_out.log";
> -
> - trace_fd = open(DEBUGFS "trace_pipe", O_RDONLY, 0);
> - if (trace_fd < 0) {
> - printf("Error opening trace_pipe\n");
> - return;
> - }
> -
> -//   Future support of ingress
> -//   if (!outFlag)
> -//   outFname = "hbm_in.log";
> - outf = fopen(outFname, "w");
> -
> - if (outf == NULL)
> - printf("Error creating %s\n", outFname);
> -
> - while (1) {
> - static char buf[4097];
> - ssize_t sz;
> -
> - sz = read(trace_fd, buf, sizeof(buf) - 1);
> - if (sz > 0) {
> - buf[sz] = 0;
> - puts(buf);
> - if (outf != NULL) {
> - fprintf(outf, "%s\n", buf);
> - fflush(outf);
> - }
> - }
> - }
> -}
> -
>  static void do_error(char *msg, bool errno_flag)
>  {
>   if (errno_flag)
> @@ -392,8 +354,15 @@ static int run_bpf_prog(char *prog, int cg_id)
>   fclose(fout);
>   }
>  
> - if (debugFlag)
> - read_trace_pipe2();
> + if (debugFlag) {
> + char *out_fname = "hbm_out.log";
> + /* Future support of ingress */
> + // if (!outFlag)
> + //  out_fname = "hbm_in.log";
> +
> + read_trace_pipe2(out_fname);
> + }
> +
>   return rc;
>  err:
>   rc = 1;
> diff --git a/tools/testing/selftests/bpf/trace_helpers.c 
> b/tools/testing/selftests/bpf/trace_helpers.c
> index 1bbd1d9830c8..b7c184e109e8 100644
> --- a/tools/testing/selftests/bpf/trace_helpers.c
> +++ b/tools/testing/selftests/bpf/trace_helpers.c
> @@ -11,8 +11,6 @@
>  #include 
>  #include "trace_helpers.h"
>  
> -#define DEBUGFS "/sys/kernel/debug/tracing/"
Is this change needed?

> -
>  #define MAX_SYMS 30
>  static struct ksym syms[MAX_SYMS];
>  static int sym_cnt;
> @@ -136,3 +134,34 @@ void read_trace_pipe(void)
>   }
>   }
>  }
> +
> +void read_trace_pipe2(char *filename)
> +{
> + int trace_fd;
> + FILE *outf;
> +
> + trace_fd = open(DEBUGFS "trace_pipe", O_RDONLY, 0);
> + if (trace_fd < 0) {
> + printf("Error opening trace_pipe\n");
> + return;
> + }
> +
> + outf = fopen(filename, "w");
> + if (!outf)
> + printf("Error creating %s\n", filename);
> +
> + while (1) {
> + static char buf[4096];
> + ssize_t sz;
> +
> + sz = read(trace_fd, buf, sizeof(buf) - 1);
> + if (sz > 0) {
> + buf[sz] = 0;
> + puts(buf);
> + if (outf) {
> + fprintf(outf, "%s\n", buf);
> + fflush(outf);
> + }
> + }
> + }
It needs a fclose().

IIUC, this function will never return.  I am not sure
this is something that should be made available to selftests.

Re: [PATCH bpf-next 2/9] samples: bpf: refactor hbm program with libbpf

2020-11-17 Thread Martin KaFai Lau
On Tue, Nov 17, 2020 at 02:56:37PM +, Daniel T. Lee wrote:
[ ... ]

> diff --git a/samples/bpf/hbm.c b/samples/bpf/hbm.c
> index b9f9f771dd81..008bc635ad9b 100644
> --- a/samples/bpf/hbm.c
> +++ b/samples/bpf/hbm.c
> @@ -46,7 +46,6 @@
>  #include 
>  #include 
>  
> -#include "bpf_load.h"
>  #include "bpf_rlimit.h"
>  #include "trace_helpers.h"
>  #include "cgroup_helpers.h"
> @@ -68,9 +67,10 @@ bool edt_flag;
>  static void Usage(void);
>  static void do_error(char *msg, bool errno_flag);
>  
> +struct bpf_program *bpf_prog;
>  struct bpf_object *obj;
> -int bpfprog_fd;
>  int cgroup_storage_fd;
> +int queue_stats_fd;
>  
>  static void do_error(char *msg, bool errno_flag)
>  {
> @@ -83,56 +83,54 @@ static void do_error(char *msg, bool errno_flag)
>  
>  static int prog_load(char *prog)
>  {
> - struct bpf_prog_load_attr prog_load_attr = {
> - .prog_type = BPF_PROG_TYPE_CGROUP_SKB,
> - .file = prog,
> - .expected_attach_type = BPF_CGROUP_INET_EGRESS,
> - };
> - int map_fd;
> - struct bpf_map *map;
> + int rc = 1;
>  
> - int ret = 0;
> + obj = bpf_object__open_file(prog, NULL);
> + if (libbpf_get_error(obj)) {
> + printf("ERROR: opening BPF object file failed\n");
> + return rc;
> + }
>  
> - if (access(prog, O_RDONLY) < 0) {
> - printf("Error accessing file %s: %s\n", prog, strerror(errno));
> - return 1;
> + /* load BPF program */
> + if (bpf_object__load(obj)) {
> + printf("ERROR: loading BPF object file failed\n");
> + goto cleanup;
>   }
> - if (bpf_prog_load_xattr(&prog_load_attr, &obj, &bpfprog_fd))
> - ret = 1;
> - if (!ret) {
> - map = bpf_object__find_map_by_name(obj, "queue_stats");
> - map_fd = bpf_map__fd(map);
> - if (map_fd < 0) {
> - printf("Map not found: %s\n", strerror(map_fd));
> - ret = 1;
> - }
> +
> + bpf_prog = bpf_object__find_program_by_title(obj, "cgroup_skb/egress");
> + if (!bpf_prog) {
> + printf("ERROR: finding a prog in obj file failed\n");
> + goto cleanup;
>   }
>  
> - if (ret) {
> - printf("ERROR: bpf_prog_load_xattr failed for: %s\n", prog);
> - printf("  Output from verifier:\n%s\n--\n", bpf_log_buf);
> - ret = -1;
> - } else {
> - ret = map_fd;
> + queue_stats_fd = bpf_object__find_map_fd_by_name(obj, "queue_stats");
> + if (queue_stats_fd < 0) {
> + printf("ERROR: finding a map in obj file failed\n");
> + goto cleanup;
>   }
>  
> - return ret;
> + rc = 0;
Just return 0.

> +
> +cleanup:
> + if (rc != 0)
so this test can be avoided.

> + bpf_object__close(obj);
> +
> + return rc;
>  }
>  
>  static int run_bpf_prog(char *prog, int cg_id)
>  {
> - int map_fd;
> - int rc = 0;
> + struct hbm_queue_stats qstats = {0};
> + struct bpf_link *link = NULL;
> + char cg_dir[100];
>   int key = 0;
>   int cg1 = 0;
> - int type = BPF_CGROUP_INET_EGRESS;
> - char cg_dir[100];
> - struct hbm_queue_stats qstats = {0};
> + int rc = 0;
>  
>   sprintf(cg_dir, "/hbm%d", cg_id);
> - map_fd = prog_load(prog);
> - if (map_fd  == -1)
> - return 1;
> + rc = prog_load(prog);
> + if (rc != 0)
> + return rc;
>  
>   if (setup_cgroup_environment()) {
>   printf("ERROR: setting cgroup environment\n");
> @@ -152,16 +150,18 @@ static int run_bpf_prog(char *prog, int cg_id)
>   qstats.stats = stats_flag ? 1 : 0;
>   qstats.loopback = loopback_flag ? 1 : 0;
>   qstats.no_cn = no_cn_flag ? 1 : 0;
> - if (bpf_map_update_elem(map_fd, &key, &qstats, BPF_ANY)) {
> + if (bpf_map_update_elem(queue_stats_fd, &key, &qstats, BPF_ANY)) {
>   printf("ERROR: Could not update map element\n");
>   goto err;
>   }
>  
>   if (!outFlag)
> - type = BPF_CGROUP_INET_INGRESS;
> - if (bpf_prog_attach(bpfprog_fd, cg1, type, 0)) {
> - printf("ERROR: bpf_prog_attach fails!\n");
> - log_err("Attaching prog");
> + bpf_program__set_expected_attach_type(bpf_prog, 
> BPF_CGROUP_INET_INGRESS);
> +
> + link = bpf_program__attach_cgroup(bpf_prog, cg1);
There is a difference here.
I think the bpf_prog will be detached when link is gone (e.g. process exit)
I am not sure it is what hbm is expected considering
cg is not clean-up on the success case.

> + if (libbpf_get_error(link)) {
> + fprintf(stderr, "ERROR: bpf_program__attach_cgroup failed\n");
> + link = NULL;
not needed.  bpf_link__destroy() can handle err ptr.

>   goto err;
>   }
>  
> @@ -175,7 +175,7 @@ static int run_bpf_prog(char *prog, int cg_id)
>  #define DELTA_RATE_CHECK 1   /*

Re: [PATCH bpf-next 3/9] samples: bpf: refactor test_cgrp2_sock2 program with libbpf

2020-11-17 Thread Martin KaFai Lau
On Tue, Nov 17, 2020 at 02:56:38PM +, Daniel T. Lee wrote:
[ ... ]

> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> index 01449d767122..7a643595ac6c 100644
> --- a/samples/bpf/Makefile
> +++ b/samples/bpf/Makefile
> @@ -82,7 +82,7 @@ test_overhead-objs := bpf_load.o test_overhead_user.o
>  test_cgrp2_array_pin-objs := test_cgrp2_array_pin.o
>  test_cgrp2_attach-objs := test_cgrp2_attach.o
>  test_cgrp2_sock-objs := test_cgrp2_sock.o
> -test_cgrp2_sock2-objs := bpf_load.o test_cgrp2_sock2.o
> +test_cgrp2_sock2-objs := test_cgrp2_sock2.o
>  xdp1-objs := xdp1_user.o
>  # reuse xdp1 source intentionally
>  xdp2-objs := xdp1_user.o
> diff --git a/samples/bpf/test_cgrp2_sock2.c b/samples/bpf/test_cgrp2_sock2.c
> index a9277b118c33..518526c7ce16 100644
> --- a/samples/bpf/test_cgrp2_sock2.c
> +++ b/samples/bpf/test_cgrp2_sock2.c
> @@ -20,9 +20,9 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "bpf_insn.h"
> -#include "bpf_load.h"
>  
>  static int usage(const char *argv0)
>  {
> @@ -32,37 +32,66 @@ static int usage(const char *argv0)
>  
>  int main(int argc, char **argv)
>  {
> - int cg_fd, ret, filter_id = 0;
> + int cg_fd, err, ret = EXIT_FAILURE, filter_id = 0, prog_cnt = 0;
> + const char *link_pin_path = "/sys/fs/bpf/test_cgrp2_sock2";
> + struct bpf_link *link = NULL;
> + struct bpf_program *progs[2];
> + struct bpf_program *prog;
> + struct bpf_object *obj;
>  
>   if (argc < 3)
>   return usage(argv[0]);
>  
> + if (argc > 3)
> + filter_id = atoi(argv[3]);
> +
>   cg_fd = open(argv[1], O_DIRECTORY | O_RDONLY);
>   if (cg_fd < 0) {
>   printf("Failed to open cgroup path: '%s'\n", strerror(errno));
> - return EXIT_FAILURE;
> + return ret;
>   }
>  
> - if (load_bpf_file(argv[2]))
> - return EXIT_FAILURE;
> -
> - printf("Output from kernel verifier:\n%s\n---\n", bpf_log_buf);
> + obj = bpf_object__open_file(argv[2], NULL);
> + if (libbpf_get_error(obj)) {
> + printf("ERROR: opening BPF object file failed\n");
> + return ret;
> + }
>  
> - if (argc > 3)
> - filter_id = atoi(argv[3]);
> + bpf_object__for_each_program(prog, obj) {
> + progs[prog_cnt] = prog;
> + prog_cnt++;
> + }
>  
>   if (filter_id >= prog_cnt) {
>   printf("Invalid program id; program not found in file\n");
> - return EXIT_FAILURE;
> + goto cleanup;
> + }
> +
> + /* load BPF program */
> + if (bpf_object__load(obj)) {
> + printf("ERROR: loading BPF object file failed\n");
> + goto cleanup;
>   }
>  
> - ret = bpf_prog_attach(prog_fd[filter_id], cg_fd,
> -   BPF_CGROUP_INET_SOCK_CREATE, 0);
> - if (ret < 0) {
> - printf("Failed to attach prog to cgroup: '%s'\n",
> -strerror(errno));
> - return EXIT_FAILURE;
> + link = bpf_program__attach_cgroup(progs[filter_id], cg_fd);
> + if (libbpf_get_error(link)) {
> + printf("ERROR: bpf_program__attach failed\n");
> + link = NULL;
> + goto cleanup;
>   }
>  
> - return EXIT_SUCCESS;
> + err = bpf_link__pin(link, link_pin_path);
> + if (err < 0) {
> + printf("err : %d\n", err);
> + goto cleanup;
> + }
> +
> + ret = EXIT_SUCCESS;
> +
> +cleanup:
> + if (ret != EXIT_SUCCESS)
> + bpf_link__destroy(link);
This looks wrong.  cleanup should be done regardless.

> +
> + bpf_object__close(obj);
> + return ret;
>  }


Re: [PATCH bpf-next 4/9] samples: bpf: refactor task_fd_query program with libbpf

2020-11-17 Thread Martin KaFai Lau
On Tue, Nov 17, 2020 at 02:56:39PM +, Daniel T. Lee wrote:
> This commit refactors the existing kprobe program with libbpf bpf
> loader. To attach bpf program, this uses generic bpf_program__attach()
> approach rather than using bpf_load's load_bpf_file().
> 
> To attach bpf to perf_event, instead of using previous ioctl method,
> this commit uses bpf_program__attach_perf_event since it manages the
> enable of perf_event and attach of BPF programs to it, which is much
> more intuitive way to achieve.
> 
> Also, explicit close(fd) has been removed since event will be closed
> inside bpf_link__destroy() automatically.
> 
> DEBUGFS macro from trace_helpers has been used to control uprobe events.
> Furthermore, to prevent conflict of same named uprobe events, O_TRUNC
> flag has been used to clear 'uprobe_events' interface.
> 
> Signed-off-by: Daniel T. Lee 
> ---
>  samples/bpf/Makefile |   2 +-
>  samples/bpf/task_fd_query_user.c | 101 ++-
>  2 files changed, 74 insertions(+), 29 deletions(-)
> 
> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> index 7a643595ac6c..36b261c7afc7 100644
> --- a/samples/bpf/Makefile
> +++ b/samples/bpf/Makefile
> @@ -107,7 +107,7 @@ xdp_adjust_tail-objs := xdp_adjust_tail_user.o
>  xdpsock-objs := xdpsock_user.o
>  xsk_fwd-objs := xsk_fwd.o
>  xdp_fwd-objs := xdp_fwd_user.o
> -task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
> +task_fd_query-objs := task_fd_query_user.o $(TRACE_HELPERS)
>  xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS)
>  ibumad-objs := bpf_load.o ibumad_user.o $(TRACE_HELPERS)
>  hbm-objs := hbm.o $(CGROUP_HELPERS) $(TRACE_HELPERS)
> diff --git a/samples/bpf/task_fd_query_user.c 
> b/samples/bpf/task_fd_query_user.c
> index b68bd2f8fdc9..0891ef3a4779 100644
> --- a/samples/bpf/task_fd_query_user.c
> +++ b/samples/bpf/task_fd_query_user.c
> @@ -15,12 +15,15 @@
>  #include 
>  #include 
>  
> +#include 
>  #include 
> -#include "bpf_load.h"
>  #include "bpf_util.h"
>  #include "perf-sys.h"
>  #include "trace_helpers.h"
>  
> +struct bpf_program *progs[2];
> +struct bpf_link *links[2];
static


Re: [PATCH bpf-next 1/2] bpf: try to avoid kzalloc in cgroup/{s,g}etsockopt

2020-12-30 Thread Martin KaFai Lau
On Mon, Dec 21, 2020 at 02:22:41PM -0800, Song Liu wrote:
> On Thu, Dec 17, 2020 at 9:24 AM Stanislav Fomichev  wrote:
> >
> > When we attach a bpf program to cgroup/getsockopt any other getsockopt()
> > syscall starts incurring kzalloc/kfree cost. While, in general, it's
> > not an issue, sometimes it is, like in the case of TCP_ZEROCOPY_RECEIVE.
> > TCP_ZEROCOPY_RECEIVE (ab)uses getsockopt system call to implement
> > fastpath for incoming TCP, we don't want to have extra allocations in
> > there.
> >
> > Let add a small buffer on the stack and use it for small (majority)
> > {s,g}etsockopt values. I've started with 128 bytes to cover
> > the options we care about (TCP_ZEROCOPY_RECEIVE which is 32 bytes
> > currently, with some planned extension to 64 + some headroom
> > for the future).
> 
> I don't really know the rule of thumb, but 128 bytes on stack feels too big to
> me. I would like to hear others' opinions on this. Can we solve the problem
> with some other mechanisms, e.g. a mempool?
It seems the do_tcp_getsockopt() is also having "struct tcp_zerocopy_receive"
in the stack.  I think the buf here is also mimicking
"struct tcp_zerocopy_receive", so should not cause any
new problem.

However, "struct tcp_zerocopy_receive" is only 40 bytes now.  I think it
is better to have a smaller buf for now and increase it later when the
the future needs in "struct tcp_zerocopy_receive" is also upstreamed.


Re: [PATCH bpf-next 1/2] bpf: try to avoid kzalloc in cgroup/{s,g}etsockopt

2020-12-30 Thread Martin KaFai Lau
On Tue, Dec 22, 2020 at 07:09:33PM -0800, s...@google.com wrote:
> On 12/22, Martin KaFai Lau wrote:
> > On Thu, Dec 17, 2020 at 09:23:23AM -0800, Stanislav Fomichev wrote:
> > > When we attach a bpf program to cgroup/getsockopt any other getsockopt()
> > > syscall starts incurring kzalloc/kfree cost. While, in general, it's
> > > not an issue, sometimes it is, like in the case of TCP_ZEROCOPY_RECEIVE.
> > > TCP_ZEROCOPY_RECEIVE (ab)uses getsockopt system call to implement
> > > fastpath for incoming TCP, we don't want to have extra allocations in
> > > there.
> > >
> > > Let add a small buffer on the stack and use it for small (majority)
> > > {s,g}etsockopt values. I've started with 128 bytes to cover
> > > the options we care about (TCP_ZEROCOPY_RECEIVE which is 32 bytes
> > > currently, with some planned extension to 64 + some headroom
> > > for the future).
> > >
> > > It seems natural to do the same for setsockopt, but it's a bit more
> > > involved when the BPF program modifies the data (where we have to
> > > kmalloc). The assumption is that for the majority of setsockopt
> > > calls (which are doing pure BPF options or apply policy) this
> > > will bring some benefit as well.
> > >
> > > Signed-off-by: Stanislav Fomichev 
> > > ---
> > >  include/linux/filter.h |  3 +++
> > >  kernel/bpf/cgroup.c| 41 +++--
> > >  2 files changed, 42 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > > index 29c27656165b..362eb0d7af5d 100644
> > > --- a/include/linux/filter.h
> > > +++ b/include/linux/filter.h
> > > @@ -1281,6 +1281,8 @@ struct bpf_sysctl_kern {
> > >   u64 tmp_reg;
> > >  };
> > >
> > > +#define BPF_SOCKOPT_KERN_BUF_SIZE128
> > Since these 128 bytes (which then needs to be zero-ed) is modeled after
> > the TCP_ZEROCOPY_RECEIVE use case, it will be useful to explain
> > a use case on how the bpf prog will interact with
> > getsockopt(TCP_ZEROCOPY_RECEIVE).
> The only thing I would expect BPF program can do is to return EPERM
> to cause application to fallback to non-zerocopy path (and, mostly,
> bypass). I don't think BPF can meaningfully mangle the data in struct
> tcp_zerocopy_receive.
> 
> Does it address your concern? Or do you want me to add a comment or
> something?
I was asking because, while 128 byte may work best for TCP_ZEROCOPY_RECEIVCE,
it is many unnecessary byte-zeroings for most options though.
Hence, I am interested to see if there is a practical bpf
use case for TCP_ZEROCOPY_RECEIVE.


Re: [PATCH bpf-next 1/2] bpf: try to avoid kzalloc in cgroup/{s,g}etsockopt

2021-01-04 Thread Martin KaFai Lau
On Thu, Dec 31, 2020 at 12:14:13PM -0800, s...@google.com wrote:
> On 12/30, Martin KaFai Lau wrote:
> > On Mon, Dec 21, 2020 at 02:22:41PM -0800, Song Liu wrote:
> > > On Thu, Dec 17, 2020 at 9:24 AM Stanislav Fomichev 
> > wrote:
> > > >
> > > > When we attach a bpf program to cgroup/getsockopt any other
> > getsockopt()
> > > > syscall starts incurring kzalloc/kfree cost. While, in general, it's
> > > > not an issue, sometimes it is, like in the case of
> > TCP_ZEROCOPY_RECEIVE.
> > > > TCP_ZEROCOPY_RECEIVE (ab)uses getsockopt system call to implement
> > > > fastpath for incoming TCP, we don't want to have extra allocations in
> > > > there.
> > > >
> > > > Let add a small buffer on the stack and use it for small (majority)
> > > > {s,g}etsockopt values. I've started with 128 bytes to cover
> > > > the options we care about (TCP_ZEROCOPY_RECEIVE which is 32 bytes
> > > > currently, with some planned extension to 64 + some headroom
> > > > for the future).
> > >
> > > I don't really know the rule of thumb, but 128 bytes on stack feels
> > too big to
> > > me. I would like to hear others' opinions on this. Can we solve the
> > problem
> > > with some other mechanisms, e.g. a mempool?
> > It seems the do_tcp_getsockopt() is also having "struct
> > tcp_zerocopy_receive"
> > in the stack.  I think the buf here is also mimicking
> > "struct tcp_zerocopy_receive", so should not cause any
> > new problem.
> Good point!
> 
> > However, "struct tcp_zerocopy_receive" is only 40 bytes now.  I think it
> > is better to have a smaller buf for now and increase it later when the
> > the future needs in "struct tcp_zerocopy_receive" is also upstreamed.
> I can lower it to 64. Or even 40?
I think either is fine.  Both will need another cacheline on bpf_sockopt_kern.
128 is a bit too much without a clear understanding on what "some headroom
for the future" means.


Re: [PATCH bpf-next v2 1/2] bpf: try to avoid kzalloc in cgroup/{s,g}etsockopt

2021-01-04 Thread Martin KaFai Lau
On Mon, Jan 04, 2021 at 02:14:53PM -0800, Stanislav Fomichev wrote:
> When we attach a bpf program to cgroup/getsockopt any other getsockopt()
> syscall starts incurring kzalloc/kfree cost. While, in general, it's
> not an issue, sometimes it is, like in the case of TCP_ZEROCOPY_RECEIVE.
> TCP_ZEROCOPY_RECEIVE (ab)uses getsockopt system call to implement
> fastpath for incoming TCP, we don't want to have extra allocations in
> there.
> 
> Let add a small buffer on the stack and use it for small (majority)
> {s,g}etsockopt values. I've started with 128 bytes to cover
> the options we care about (TCP_ZEROCOPY_RECEIVE which is 32 bytes
> currently, with some planned extension to 64).
> 
> It seems natural to do the same for setsockopt, but it's a bit more
> involved when the BPF program modifies the data (where we have to
> kmalloc). The assumption is that for the majority of setsockopt
> calls (which are doing pure BPF options or apply policy) this
> will bring some benefit as well.
> 
> Collected some performance numbers using (on a 65k MTU localhost in a VM):
> $ perf record -g -- ./tcp_mmap -s -z
> $ ./tcp_mmap -H ::1 -z
> $ ...
> $ perf report --symbol-filter=__cgroup_bpf_run_filter_getsockopt
> 
> Without this patch:
>  4.81% 0.07%  tcp_mmap  [kernel.kallsyms]  [k] 
> __cgroup_bpf_run_filter_>
> |
>  --4.74%--__cgroup_bpf_run_filter_getsockopt
>|
>|--1.06%--__kmalloc
>|
>|--0.71%--lock_sock_nested
>|
>|--0.62%--__might_fault
>|
> --0.52%--release_sock
> 
> With the patch applied:
>  3.29% 0.07%  tcp_mmap  [kernel.kallsyms]  [k] 
> __cgroup_bpf_run_filter_getsockopt
> |
>  --3.22%--__cgroup_bpf_run_filter_getsockopt
>|
>|--0.66%--lock_sock_nested
>|
>|--0.57%--__might_fault
>|
> --0.56%--release_sock
> 
> So it saves about 1% of the system call. Unfortunately, we still get
> 2-3% of overhead due to another socket lock/unlock :-(
That could be a future exercise to optimize the fast path sockopts. ;)

[ ... ]

> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include  /* sizeof(struct tcp_zerocopy_receive) */
To be more specific, it should be .

>  
>  #include "../cgroup/cgroup-internal.h"
>  
> @@ -1298,6 +1299,7 @@ static bool __cgroup_bpf_prog_array_is_empty(struct 
> cgroup *cgrp,
>   return empty;
>  }
>  
> +
Extra newline.

>  static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen)
>  {
>   if (unlikely(max_optlen < 0))
> @@ -1310,6 +1312,18 @@ static int sockopt_alloc_buf(struct bpf_sockopt_kern 
> *ctx, int max_optlen)
>   max_optlen = PAGE_SIZE;
>   }
>  
> + if (max_optlen <= sizeof(ctx->buf)) {
> + /* When the optval fits into BPF_SOCKOPT_KERN_BUF_SIZE
> +  * bytes avoid the cost of kzalloc.
> +  */
If it needs to respin, it will be good to have a few words here on why
it only BUILD_BUG checks for "struct tcp_zerocopy_receive".

> + BUILD_BUG_ON(sizeof(struct tcp_zerocopy_receive) >
> +  BPF_SOCKOPT_KERN_BUF_SIZE);
> +
> + ctx->optval = ctx->buf;
> + ctx->optval_end = ctx->optval + max_optlen;
> + return max_optlen;
> + }
> +


Re: [PATCH v3 0/4] btf: support ints larger than 128 bits

2021-01-05 Thread Martin KaFai Lau
On Tue, Jan 05, 2021 at 02:45:30PM +, Sean Young wrote:
> clang supports arbitrary length ints using the _ExtInt extension. This
> can be useful to hold very large values, e.g. 256 bit or 512 bit types.
> 
> Larger types (e.g. 1024 bits) are possible but I am unaware of a use
> case for these.
> 
> This requires the _ExtInt extension enabled in clang, which is under
> review.
1. Please explain the use case.
2. All patches have the same commit message which is not useful.
   Please spend some time in the commit message to explain what each
   individual patch does.
3. The test_extint.py is mostly a copy-and-paste from the existing
   test_offload.py?  Does it need most of the test_offload.py
   to test the BTF 256/512 bit int?  Please create a minimal
   test and use the test_progs.c infra-structure.


Re: [PATCH] selftests/bpf: remove duplicate include in test_lsm

2021-01-05 Thread Martin KaFai Lau
On Tue, Jan 05, 2021 at 07:20:47AM -0800, menglong8.d...@gmail.com wrote:
> From: Menglong Dong 
> 
> 'unistd.h' included in 'selftests/bpf/prog_tests/test_lsm.c' is
> duplicated.
It is for bpf-next.  Please put a proper tag next time.

Acked-by: Martin KaFai Lau 


Re: [PATCH bpf-next v3 3/3] bpf: remove extra lock_sock for TCP_ZEROCOPY_RECEIVE

2021-01-06 Thread Martin KaFai Lau
On Tue, Jan 05, 2021 at 01:43:50PM -0800, Stanislav Fomichev wrote:
> Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE.
> We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom
> call in do_tcp_getsockopt using the on-stack data. This removes
> 3% overhead for locking/unlocking the socket.
> 
> Also:
> - Removed BUILD_BUG_ON (zerocopy doesn't depend on the buf size anymore)
> - Separated on-stack buffer into bpf_sockopt_buf and downsized to 32 bytes
>   (let's keep it to help with the other options)
> 
> (I can probably split this patch into two: add new features and rework
>  bpf_sockopt_buf; can follow up if the approach in general sounds
>  good).
> 
> Without this patch:
>  3.29% 0.07%  tcp_mmap  [kernel.kallsyms]  [k] 
> __cgroup_bpf_run_filter_getsockopt
> |
>  --3.22%--__cgroup_bpf_run_filter_getsockopt
>|
>|--0.66%--lock_sock_nested
A general question for sockopt prog, why the BPF_CGROUP_(GET|SET)SOCKOPT prog
has to run under lock_sock()?

>|
>|--0.57%--__might_fault
>|
> --0.56%--release_sock
> 
> With the patch applied:
>  0.42% 0.10%  tcp_mmap  [kernel.kallsyms]  [k] 
> __cgroup_bpf_run_filter_getsockopt_kern
>  0.02% 0.02%  tcp_mmap  [kernel.kallsyms]  [k] 
> __cgroup_bpf_run_filter_getsockopt
> 
[ ... ]

> @@ -1445,15 +1442,29 @@ int __cgroup_bpf_run_filter_getsockopt(struct sock 
> *sk, int level,
>  int __user *optlen, int max_optlen,
>  int retval)
>  {
> - struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> - struct bpf_sockopt_kern ctx = {
> - .sk = sk,
> - .level = level,
> - .optname = optname,
> - .retval = retval,
> - };
> + struct bpf_sockopt_kern ctx;
> + struct bpf_sockopt_buf buf;
> + struct cgroup *cgrp;
>   int ret;
>  
> +#ifdef CONFIG_INET
> + /* TCP do_tcp_getsockopt has optimized getsockopt implementation
> +  * to avoid extra socket lock for TCP_ZEROCOPY_RECEIVE.
> +  */
> + if (sk->sk_prot->getsockopt == tcp_getsockopt &&
> + level == SOL_TCP && optname == TCP_ZEROCOPY_RECEIVE)
> + return retval;
> +#endif
That seems too much protocol details and not very scalable.
It is not very related to kernel/bpf/cgroup.c which has very little idea
whether a specific protocol has optimized things in some ways (e.g. by
directly calling cgroup's bpf prog at some strategic places in this patch).
Lets see if it can be done better.

At least, these protocol checks belong to the net's socket.c
more than the bpf's cgroup.c here.  If it also looks like layering
breakage in socket.c, may be adding a signal in sk_prot (for example)
to tell if the sk_prot->getsockopt has already called the cgroup's bpf
prog?  (e.g. tcp_getsockopt() can directly call the cgroup's bpf for all
optname instead of only TCP_ZEROCOPY_RECEIVE).

For example:

int __sys_getsockopt(...)
{
/* ... */

if (!sk_prot->bpf_getsockopt_handled)
BPF_CGROUP_RUN_PROG_GETSOCKOPT(...);
}

Thoughts?

> +
> + memset(&buf, 0, sizeof(buf));
> + memset(&ctx, 0, sizeof(ctx));
> +
> + cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> + ctx.sk = sk;
> + ctx.level = level;
> + ctx.optname = optname;
> + ctx.retval = retval;
> +
>   /* Opportunistic check to see whether we have any BPF program
>* attached to the hook so we don't waste time allocating
>* memory and locking the socket.


Re: [PATCH bpf] bpftool: fix compilation failure for net.o with older glibc

2021-01-06 Thread Martin KaFai Lau
On Wed, Jan 06, 2021 at 03:59:06PM +, Alan Maguire wrote:
> For older glibc ~2.17, #include'ing both linux/if.h and net/if.h
> fails due to complaints about redefinition of interface flags:
> 
>   CC   net.o
> In file included from net.c:13:0:
> /usr/include/linux/if.h:71:2: error: redeclaration of enumerator ‘IFF_UP’
>   IFF_UP= 1<<0,  /* sysfs */
>   ^
> /usr/include/net/if.h:44:5: note: previous definition of ‘IFF_UP’ was here
>  IFF_UP = 0x1,  /* Interface is up.  */
> 
> The issue was fixed in kernel headers in [1], but since compilation
> of net.c picks up system headers the problem can recur.
> 
> Dropping #include  resolves the issue and it is
> not needed for compilation anyhow.
Acked-by: Martin KaFai Lau 


Re: [PATCH bpf-next v3 3/3] bpf: remove extra lock_sock for TCP_ZEROCOPY_RECEIVE

2021-01-06 Thread Martin KaFai Lau
On Wed, Jan 06, 2021 at 02:45:56PM -0800, s...@google.com wrote:
> On 01/06, Martin KaFai Lau wrote:
> > On Tue, Jan 05, 2021 at 01:43:50PM -0800, Stanislav Fomichev wrote:
> > > Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE.
> > > We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom
> > > call in do_tcp_getsockopt using the on-stack data. This removes
> > > 3% overhead for locking/unlocking the socket.
> > >
> > > Also:
> > > - Removed BUILD_BUG_ON (zerocopy doesn't depend on the buf size anymore)
> > > - Separated on-stack buffer into bpf_sockopt_buf and downsized to 32
> > bytes
> > >   (let's keep it to help with the other options)
> > >
> > > (I can probably split this patch into two: add new features and rework
> > >  bpf_sockopt_buf; can follow up if the approach in general sounds
> > >  good).
> > >
> > > Without this patch:
> > >  3.29% 0.07%  tcp_mmap  [kernel.kallsyms]  [k]
> > __cgroup_bpf_run_filter_getsockopt
> > > |
> > >  --3.22%--__cgroup_bpf_run_filter_getsockopt
> > >|
> > >|--0.66%--lock_sock_nested
> > A general question for sockopt prog, why the BPF_CGROUP_(GET|SET)SOCKOPT
> > prog
> > has to run under lock_sock()?
> I don't think there is a strong reason. We expose sk to the BPF program,
> but mainly for the socket storage map (which, afaik, doesn't require
> socket to be locked). OTOH, it seems that providing a consistent view
> of the sk to the BPF is a good idea.
hmm... most of the bpf prog also does not require a locked sock.  For
example, the __sk_buff->sk.  If a bpf prog needs a locked view of sk,
a more generic solution is desired.  Anyhow, I guess the train has sort
of sailed for sockopt bpf.

> 
> Eric has suggested to try to use fast socket lock. It helps a bit,
> but it doesn't remove the issue completely because
> we do a bunch of copy_{to,from}_user in the generic
> __cgroup_bpf_run_filter_getsockopt as well :-(
> 
> > >|
> > >|--0.57%--__might_fault
Is it a debug kernel?

> > >|
> > > --0.56%--release_sock
> > >
> > > With the patch applied:
> > >  0.42% 0.10%  tcp_mmap  [kernel.kallsyms]  [k]
> > __cgroup_bpf_run_filter_getsockopt_kern
> > >  0.02% 0.02%  tcp_mmap  [kernel.kallsyms]  [k]
> > __cgroup_bpf_run_filter_getsockopt
> > >
> > [ ... ]
> 
> > > @@ -1445,15 +1442,29 @@ int __cgroup_bpf_run_filter_getsockopt(struct
> > sock *sk, int level,
> > >  int __user *optlen, int max_optlen,
> > >  int retval)
> > >  {
> > > - struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> > > - struct bpf_sockopt_kern ctx = {
> > > - .sk = sk,
> > > - .level = level,
> > > - .optname = optname,
> > > - .retval = retval,
> > > - };
> > > + struct bpf_sockopt_kern ctx;
> > > + struct bpf_sockopt_buf buf;
> > > + struct cgroup *cgrp;
> > >   int ret;
> > >
> > > +#ifdef CONFIG_INET
> > > + /* TCP do_tcp_getsockopt has optimized getsockopt implementation
> > > +  * to avoid extra socket lock for TCP_ZEROCOPY_RECEIVE.
> > > +  */
> > > + if (sk->sk_prot->getsockopt == tcp_getsockopt &&
> > > + level == SOL_TCP && optname == TCP_ZEROCOPY_RECEIVE)
> > > + return retval;
> > > +#endif
> > That seems too much protocol details and not very scalable.
> > It is not very related to kernel/bpf/cgroup.c which has very little idea
> > whether a specific protocol has optimized things in some ways (e.g. by
> > directly calling cgroup's bpf prog at some strategic places in this
> > patch).
> > Lets see if it can be done better.
> 
> > At least, these protocol checks belong to the net's socket.c
> > more than the bpf's cgroup.c here.  If it also looks like layering
> > breakage in socket.c, may be adding a signal in sk_prot (for example)
> > to tell if the sk_prot->getsockopt has already called the cgroup's bpf
> > prog?  (e.g. tcp_getsockopt() can directly call the cgroup's bpf for all
> > optname instead of only TCP_ZEROCOPY_RECEIVE).
> 
> > For example:
> 
> > int __sys_getsockopt(...)
> > {
> > /* ... */
>

Re: [PATCH bpf-next v4 3/3] bpf: remove extra lock_sock for TCP_ZEROCOPY_RECEIVE

2021-01-07 Thread Martin KaFai Lau
On Thu, Jan 07, 2021 at 10:43:05AM -0800, Stanislav Fomichev wrote:
> Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE.
> We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom
> call in do_tcp_getsockopt using the on-stack data. This removes
> 2% overhead for locking/unlocking the socket.
> 
> Also:
> - Removed BUILD_BUG_ON (zerocopy doesn't depend on the buf size anymore)
> - Separated on-stack buffer into bpf_sockopt_buf and downsized to 32 bytes
>   (let's keep it to help with the other options)
> 
> (I can probably split this patch into two: add new features and rework
>  bpf_sockopt_buf; can follow up if the approach in general sounds
>  good).
> 
> Without this patch:
>  1.87% 0.06%  tcp_mmap  [kernel.kallsyms]  [k] 
> __cgroup_bpf_run_filter_getsockopt
> 
> With the patch applied:
>  0.52% 0.12%  tcp_mmap  [kernel.kallsyms]  [k] 
> __cgroup_bpf_run_filter_getsockopt_kern
> 
> Signed-off-by: Stanislav Fomichev 
> Cc: Martin KaFai Lau 
> Cc: Song Liu 
> Cc: Eric Dumazet 
> ---
>  include/linux/bpf-cgroup.h| 25 -
>  include/linux/filter.h|  6 +-
>  include/net/sock.h|  2 +
>  include/net/tcp.h |  1 +
>  kernel/bpf/cgroup.c   | 93 +--
>  net/ipv4/tcp.c| 14 +++
>  net/ipv4/tcp_ipv4.c   |  1 +
>  net/ipv6/tcp_ipv6.c   |  1 +
>  .../selftests/bpf/prog_tests/sockopt_sk.c | 22 +
>  .../testing/selftests/bpf/progs/sockopt_sk.c  | 15 +++
>  10 files changed, 147 insertions(+), 33 deletions(-)
>

[ ... ]

> @@ -454,6 +469,8 @@ static inline int bpf_percpu_cgroup_storage_update(struct 
> bpf_map *map,
>  #define BPF_CGROUP_GETSOCKOPT_MAX_OPTLEN(optlen) ({ 0; })
>  #define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, \
>  optlen, max_optlen, retval) ({ retval; })
> +#define BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sock, level, optname, optval, \
> + optlen, retval) ({ retval; })
>  #define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen, 
> \
>  kernel_optval) ({ 0; })
>  
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 54a4225f36d8..8739f1d4cac4 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1281,7 +1281,10 @@ struct bpf_sysctl_kern {
>   u64 tmp_reg;
>  };
>  
> -#define BPF_SOCKOPT_KERN_BUF_SIZE64
> +#define BPF_SOCKOPT_KERN_BUF_SIZE32
It is reduced from patch 1 because there is no
need to use the buf (and copy from/to buf) in TCP_ZEROCOPY_RECEIVE?

Patch 1 is still desired (and kept in this set) because it may still
benefit other optname?

> +struct bpf_sockopt_buf {
> + u8  data[BPF_SOCKOPT_KERN_BUF_SIZE];
> +};
>  
>  struct bpf_sockopt_kern {
>   struct sock *sk;
> @@ -1291,7 +1294,6 @@ struct bpf_sockopt_kern {
>   s32 optname;
>   s32 optlen;
>   s32 retval;
> - u8  buf[BPF_SOCKOPT_KERN_BUF_SIZE];
It is better to pick one way to do thing to avoid code
churn like this within the same series.

>  };
>  
>  int copy_bpf_fprog_from_user(struct sock_fprog *dst, sockptr_t src, int len);
> diff --git a/include/net/sock.h b/include/net/sock.h
> index bdc4323ce53c..ebf44d724845 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1174,6 +1174,8 @@ struct proto {
>  
>   int (*backlog_rcv) (struct sock *sk,
>   struct sk_buff *skb);
> + bool(*bpf_bypass_getsockopt)(int level,
> +  int optname);
>  
>   void(*release_cb)(struct sock *sk);
>  
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 78d13c88720f..4bb42fb19711 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -403,6 +403,7 @@ __poll_t tcp_poll(struct file *file, struct socket *sock,
> struct poll_table_struct *wait);
>  int tcp_getsockopt(struct sock *sk, int level, int optname,
>  char __user *optval, int __user *optlen);
> +bool tcp_bpf_bypass_getsockopt(int level, int optname);
>  int tcp_setsockopt(struct sock *sk, int level, int optname, sockptr_t optval,
>  unsigned int optlen);
>  void tcp_set_keepalive(struct sock *sk, int val);
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index adbecdcaa370..e82df63aedc7 100644

Re: [PATCH bpf-next v6 1/3] bpf: remove extra lock_sock for TCP_ZEROCOPY_RECEIVE

2021-01-08 Thread Martin KaFai Lau
On Fri, Jan 08, 2021 at 01:02:21PM -0800, Stanislav Fomichev wrote:
> Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE.
> We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom
> call in do_tcp_getsockopt using the on-stack data. This removes
> 3% overhead for locking/unlocking the socket.
> 
> Without this patch:
>  3.38% 0.07%  tcp_mmap  [kernel.kallsyms]  [k] 
> __cgroup_bpf_run_filter_getsockopt
> |
>  --3.30%--__cgroup_bpf_run_filter_getsockopt
>|
> --0.81%--__kmalloc
> 
> With the patch applied:
>  0.52% 0.12%  tcp_mmap  [kernel.kallsyms]  [k] 
> __cgroup_bpf_run_filter_getsockopt_kern
> 
> Signed-off-by: Stanislav Fomichev 
> Cc: Martin KaFai Lau 
> Cc: Song Liu 
> Cc: Eric Dumazet 
> ---
>  include/linux/bpf-cgroup.h| 27 +++--
>  include/linux/indirect_call_wrapper.h |  6 +++
>  include/net/sock.h|  2 +
>  include/net/tcp.h |  1 +
>  kernel/bpf/cgroup.c   | 38 +++
>  net/ipv4/tcp.c| 14 +++
>  net/ipv4/tcp_ipv4.c   |  1 +
>  net/ipv6/tcp_ipv6.c   |  1 +
>  net/socket.c  |  3 ++
>  .../selftests/bpf/prog_tests/sockopt_sk.c | 22 +++
>  .../testing/selftests/bpf/progs/sockopt_sk.c  | 15 
>  11 files changed, 126 insertions(+), 4 deletions(-)
> 
[ ... ]

> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index 6ec088a96302..c41bb2f34013 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -1485,6 +1485,44 @@ int __cgroup_bpf_run_filter_getsockopt(struct sock 
> *sk, int level,
>   sockopt_free_buf(&ctx);
>   return ret;
>  }
> +
> +int __cgroup_bpf_run_filter_getsockopt_kern(struct sock *sk, int level,
> + int optname, void *optval,
> + int *optlen, int retval)
> +{
> + struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> + struct bpf_sockopt_kern ctx = {
> + .sk = sk,
> + .level = level,
> + .optname = optname,
> + .retval = retval,
> + .optlen = *optlen,
> + .optval = optval,
> + .optval_end = optval + *optlen,
> + };
> + int ret;
> +
The current behavior only passes kernel optval to bpf prog when
retval == 0.  Can you explain a few words here about
the difference and why it is fine?
Just in case some other options may want to reuse the
__cgroup_bpf_run_filter_getsockopt_kern() in the future.

> + ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[BPF_CGROUP_GETSOCKOPT],
> +  &ctx, BPF_PROG_RUN);
> + if (!ret)
> + return -EPERM;
> +
> + if (ctx.optlen > *optlen)
> + return -EFAULT;
> +
> + /* BPF programs only allowed to set retval to 0, not some
> +  * arbitrary value.
> +  */
> + if (ctx.retval != 0 && ctx.retval != retval)
> + return -EFAULT;
> +
> + /* BPF programs can shrink the buffer, export the modifications.
> +  */
> + if (ctx.optlen != 0)
> + *optlen = ctx.optlen;
> +
> + return ctx.retval;
> +}
>  #endif
>  
>  static ssize_t sysctl_cpy_dir(const struct ctl_dir *dir, char **bufp,

[ ... ]

> diff --git a/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c 
> b/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c
> index b25c9c45c148..6bb18b1d8578 100644
> --- a/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c
> +++ b/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c
> @@ -11,6 +11,7 @@ static int getsetsockopt(void)
>   char u8[4];
>   __u32 u32;
>   char cc[16]; /* TCP_CA_NAME_MAX */
> + struct tcp_zerocopy_receive zc;
I suspect it won't compile at least in my setup.

However, I compile tools/testing/selftests/net/tcp_mmap.c fine though.
I _guess_ it is because the net's test has included kernel/usr/include.

AFAIK, bpf's tests use tools/include/uapi/.

Others LGTM.


Re: [PATCH bpf-next v6 2/3] bpf: try to avoid kzalloc in cgroup/{s,g}etsockopt

2021-01-08 Thread Martin KaFai Lau
On Fri, Jan 08, 2021 at 01:02:22PM -0800, Stanislav Fomichev wrote:
> When we attach a bpf program to cgroup/getsockopt any other getsockopt()
> syscall starts incurring kzalloc/kfree cost.
> 
> Let add a small buffer on the stack and use it for small (majority)
> {s,g}etsockopt values. The buffer is small enough to fit into
> the cache line and cover the majority of simple options (most
> of them are 4 byte ints).
> 
> It seems natural to do the same for setsockopt, but it's a bit more
> involved when the BPF program modifies the data (where we have to
> kmalloc). The assumption is that for the majority of setsockopt
> calls (which are doing pure BPF options or apply policy) this
> will bring some benefit as well.
> 
> Without this patch (we remove about 1% __kmalloc):
>  3.38% 0.07%  tcp_mmap  [kernel.kallsyms]  [k] 
> __cgroup_bpf_run_filter_getsockopt
> |
>  --3.30%--__cgroup_bpf_run_filter_getsockopt
>|
>     --0.81%--__kmalloc
> 
> Signed-off-by: Stanislav Fomichev 
> Cc: Martin KaFai Lau 
> Cc: Song Liu 
> ---
>  include/linux/filter.h |  5 
>  kernel/bpf/cgroup.c| 52 --
>  2 files changed, 50 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 29c27656165b..8739f1d4cac4 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1281,6 +1281,11 @@ struct bpf_sysctl_kern {
>   u64 tmp_reg;
>  };
>  
> +#define BPF_SOCKOPT_KERN_BUF_SIZE32
> +struct bpf_sockopt_buf {
> + u8  data[BPF_SOCKOPT_KERN_BUF_SIZE];
> +};
> +
>  struct bpf_sockopt_kern {
>   struct sock *sk;
>   u8  *optval;
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index c41bb2f34013..a9aad9c419e1 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -1298,7 +1298,8 @@ static bool __cgroup_bpf_prog_array_is_empty(struct 
> cgroup *cgrp,
>   return empty;
>  }
>  
> -static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen)
> +static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen,
> +  struct bpf_sockopt_buf *buf)
>  {
>   if (unlikely(max_optlen < 0))
>   return -EINVAL;
> @@ -1310,6 +1311,15 @@ static int sockopt_alloc_buf(struct bpf_sockopt_kern 
> *ctx, int max_optlen)
>   max_optlen = PAGE_SIZE;
>   }
>  
> + if (max_optlen <= sizeof(buf->data)) {
> + /* When the optval fits into BPF_SOCKOPT_KERN_BUF_SIZE
> +  * bytes avoid the cost of kzalloc.
> +  */
> + ctx->optval = buf->data;
> + ctx->optval_end = ctx->optval + max_optlen;
> + return max_optlen;
> + }
> +
>   ctx->optval = kzalloc(max_optlen, GFP_USER);
>   if (!ctx->optval)
>   return -ENOMEM;
> @@ -1319,16 +1329,26 @@ static int sockopt_alloc_buf(struct bpf_sockopt_kern 
> *ctx, int max_optlen)
>   return max_optlen;
>  }
>  
> -static void sockopt_free_buf(struct bpf_sockopt_kern *ctx)
> +static void sockopt_free_buf(struct bpf_sockopt_kern *ctx,
> +  struct bpf_sockopt_buf *buf)
>  {
> + if (ctx->optval == buf->data)
> + return;
>   kfree(ctx->optval);
>  }
>  
> +static bool sockopt_buf_allocated(struct bpf_sockopt_kern *ctx,
> +   struct bpf_sockopt_buf *buf)
> +{
> + return ctx->optval != buf->data;
> +}
> +
>  int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level,
>  int *optname, char __user *optval,
>  int *optlen, char **kernel_optval)
>  {
>   struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> + struct bpf_sockopt_buf buf = {};
>   struct bpf_sockopt_kern ctx = {
>   .sk = sk,
>   .level = *level,
> @@ -1350,7 +1370,7 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, 
> int *level,
>*/
>   max_optlen = max_t(int, 16, *optlen);
>  
> - max_optlen = sockopt_alloc_buf(&ctx, max_optlen);
> + max_optlen = sockopt_alloc_buf(&ctx, max_optlen, &buf);
>   if (max_optlen < 0)
>   return max_optlen;
>  
> @@ -1390,13 +1410,30 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock 
> *sk, int *level,
>*/
>   if (ctx.optlen != 0) {
When ctx.optlen == 0, is sockopt_free_buf() called?
Did I miss something?

>  

Re: [PATCH bpf-next 1/4] bpf: enable task local storage for tracing programs

2021-01-11 Thread Martin KaFai Lau
On Fri, Jan 08, 2021 at 03:19:47PM -0800, Song Liu wrote:

[ ... ]

> diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
> index dd5aedee99e73..9bd47ad2b26f1 100644
> --- a/kernel/bpf/bpf_local_storage.c
> +++ b/kernel/bpf/bpf_local_storage.c
> @@ -140,17 +140,18 @@ static void __bpf_selem_unlink_storage(struct 
> bpf_local_storage_elem *selem)
>  {
>   struct bpf_local_storage *local_storage;
>   bool free_local_storage = false;
> + unsigned long flags;
>  
>   if (unlikely(!selem_linked_to_storage(selem)))
>   /* selem has already been unlinked from sk */
>   return;
>  
>   local_storage = rcu_dereference(selem->local_storage);
> - raw_spin_lock_bh(&local_storage->lock);
> + raw_spin_lock_irqsave(&local_storage->lock, flags);
It will be useful to have a few words in commit message on this change
for future reference purpose.

Please also remove the in_irq() check from bpf_sk_storage.c
to avoid confusion in the future.  It probably should
be in a separate patch.

[ ... ]

> diff --git a/kernel/bpf/bpf_task_storage.c b/kernel/bpf/bpf_task_storage.c
> index 4ef1959a78f27..f654b56907b69 100644
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 7425b3224891d..3d65c8ebfd594 100644
[ ... ]

> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -96,6 +96,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -734,6 +735,7 @@ void __put_task_struct(struct task_struct *tsk)
>   cgroup_free(tsk);
>   task_numa_free(tsk, true);
>   security_task_free(tsk);
> + bpf_task_storage_free(tsk);
>   exit_creds(tsk);
If exit_creds() is traced by a bpf and this bpf is doing
bpf_task_storage_get(..., BPF_LOCAL_STORAGE_GET_F_CREATE),
new task storage will be created after bpf_task_storage_free().

I recalled there was an earlier discussion with KP and KP mentioned
BPF_LSM will not be called with a task that is going away.
It seems enabling bpf task storage in bpf tracing will break
this assumption and needs to be addressed?


Re: [PATCH bpf-next 1/4] bpf: enable task local storage for tracing programs

2021-01-11 Thread Martin KaFai Lau
On Mon, Jan 11, 2021 at 10:35:43PM +0100, KP Singh wrote:
> On Mon, Jan 11, 2021 at 7:57 PM Martin KaFai Lau  wrote:
> >
> > On Fri, Jan 08, 2021 at 03:19:47PM -0800, Song Liu wrote:
> >
> > [ ... ]
> >
> > > diff --git a/kernel/bpf/bpf_local_storage.c 
> > > b/kernel/bpf/bpf_local_storage.c
> > > index dd5aedee99e73..9bd47ad2b26f1 100644
> > > --- a/kernel/bpf/bpf_local_storage.c
> > > +++ b/kernel/bpf/bpf_local_storage.c
> > > @@ -140,17 +140,18 @@ static void __bpf_selem_unlink_storage(struct 
> > > bpf_local_storage_elem *selem)
> > >  {
> > >   struct bpf_local_storage *local_storage;
> > >   bool free_local_storage = false;
> > > + unsigned long flags;
> > >
> > >   if (unlikely(!selem_linked_to_storage(selem)))
> > >   /* selem has already been unlinked from sk */
> > >   return;
> > >
> > >   local_storage = rcu_dereference(selem->local_storage);
> > > - raw_spin_lock_bh(&local_storage->lock);
> > > + raw_spin_lock_irqsave(&local_storage->lock, flags);
> > It will be useful to have a few words in commit message on this change
> > for future reference purpose.
> >
> > Please also remove the in_irq() check from bpf_sk_storage.c
> > to avoid confusion in the future.  It probably should
> > be in a separate patch.
> >
> > [ ... ]
> >
> > > diff --git a/kernel/bpf/bpf_task_storage.c b/kernel/bpf/bpf_task_storage.c
> > > index 4ef1959a78f27..f654b56907b69 100644
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index 7425b3224891d..3d65c8ebfd594 100644
> > [ ... ]
> >
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -96,6 +96,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >
> > >  #include 
> > >  #include 
> > > @@ -734,6 +735,7 @@ void __put_task_struct(struct task_struct *tsk)
> > >   cgroup_free(tsk);
> > >   task_numa_free(tsk, true);
> > >   security_task_free(tsk);
> > > + bpf_task_storage_free(tsk);
> > >   exit_creds(tsk);
> > If exit_creds() is traced by a bpf and this bpf is doing
> > bpf_task_storage_get(..., BPF_LOCAL_STORAGE_GET_F_CREATE),
> > new task storage will be created after bpf_task_storage_free().
> >
> > I recalled there was an earlier discussion with KP and KP mentioned
> > BPF_LSM will not be called with a task that is going away.
> > It seems enabling bpf task storage in bpf tracing will break
> > this assumption and needs to be addressed?
> 
> For tracing programs, I think we will need an allow list where
> task local storage can be used.
Instead of whitelist, can refcount_inc_not_zero(&tsk->usage) be used?


Re: [PATCH bpf] bpf: don't leak memory in bpf getsockopt when optlen == 0

2021-01-11 Thread Martin KaFai Lau
On Mon, Jan 11, 2021 at 11:47:38AM -0800, Stanislav Fomichev wrote:
> optlen == 0 indicates that the kernel should ignore BPF buffer
> and use the original one from the user. We, however, forget
> to free the temporary buffer that we've allocated for BPF.
> 
> Reported-by: Martin KaFai Lau 
> Fixes: d8fe449a9c51 ("bpf: Don't return EINVAL from {get,set}sockopt when 
> optlen > PAGE_SIZE")
> Signed-off-by: Stanislav Fomichev 
> ---
>  kernel/bpf/cgroup.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index 6ec088a96302..09179ab72c03 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -1395,7 +1395,7 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, 
> int *level,
>   }
>  
>  out:
> - if (ret)
> + if (*kernel_optval == NULL)
It seems fragile to depend on the caller to init *kernel_optval to NULL.

How about something like:

diff --git i/kernel/bpf/cgroup.c w/kernel/bpf/cgroup.c
index 6ec088a96302..8d94c004e781 100644
--- i/kernel/bpf/cgroup.c
+++ w/kernel/bpf/cgroup.c
@@ -1358,7 +1358,7 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, 
int *level,
 
if (copy_from_user(ctx.optval, optval, min(*optlen, max_optlen)) != 0) {
ret = -EFAULT;
-   goto out;
+   goto err_out;
}
 
lock_sock(sk);
@@ -1368,7 +1368,7 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, 
int *level,
 
if (!ret) {
ret = -EPERM;
-   goto out;
+   goto err_out;
}
 
if (ctx.optlen == -1) {
@@ -1379,7 +1379,6 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, 
int *level,
ret = -EFAULT;
} else {
/* optlen within bounds, run kernel handler */
-   ret = 0;
 
/* export any potential modifications */
*level = ctx.level;
@@ -1391,12 +1390,15 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, 
int *level,
if (ctx.optlen != 0) {
*optlen = ctx.optlen;
*kernel_optval = ctx.optval;
+   } else {
+   sockopt_free_buf(&ctx);
}
+
+   return 0;
}
 
-out:
-   if (ret)
-   sockopt_free_buf(&ctx);
+err_out:
+   sockopt_free_buf(&ctx);
return ret;
 }


Re: [PATCH bpf] bpf: don't leak memory in bpf getsockopt when optlen == 0

2021-01-11 Thread Martin KaFai Lau
On Mon, Jan 11, 2021 at 02:38:02PM -0800, Stanislav Fomichev wrote:
> On Mon, Jan 11, 2021 at 2:32 PM Martin KaFai Lau  wrote:
> >
> > On Mon, Jan 11, 2021 at 11:47:38AM -0800, Stanislav Fomichev wrote:
> > > optlen == 0 indicates that the kernel should ignore BPF buffer
> > > and use the original one from the user. We, however, forget
> > > to free the temporary buffer that we've allocated for BPF.
> > >
> > > Reported-by: Martin KaFai Lau 
> > > Fixes: d8fe449a9c51 ("bpf: Don't return EINVAL from {get,set}sockopt when 
> > > optlen > PAGE_SIZE")
> > > Signed-off-by: Stanislav Fomichev 
> > > ---
> > >  kernel/bpf/cgroup.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> > > index 6ec088a96302..09179ab72c03 100644
> > > --- a/kernel/bpf/cgroup.c
> > > +++ b/kernel/bpf/cgroup.c
> > > @@ -1395,7 +1395,7 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock 
> > > *sk, int *level,
> > >   }
> > >
> > >  out:
> > > - if (ret)
> > > + if (*kernel_optval == NULL)
> > It seems fragile to depend on the caller to init *kernel_optval to NULL.
> We can manually reset it to NULL when we enter
> __cgroup_bpf_run_filter_setsockopt,
It feels weird to reset the caller value at the beginning while this is not
intended to be an _init() like function, so I avoided it.

but yeah, I am fine on this way also and won't oppose it strongly ;)

> I didn't bother since there is only one existing caller.
> 
> But you patch also LGTM, I don't really have a preference.
> 
> > How about something like:
> >
> > diff --git i/kernel/bpf/cgroup.c w/kernel/bpf/cgroup.c
> > index 6ec088a96302..8d94c004e781 100644
> > --- i/kernel/bpf/cgroup.c
> > +++ w/kernel/bpf/cgroup.c
> > @@ -1358,7 +1358,7 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock 
> > *sk, int *level,
> >
> > if (copy_from_user(ctx.optval, optval, min(*optlen, max_optlen)) != 
> > 0) {
> > ret = -EFAULT;
> > -   goto out;
> > +   goto err_out;
> > }
> >
> > lock_sock(sk);
> > @@ -1368,7 +1368,7 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock 
> > *sk, int *level,
> >
> > if (!ret) {
> > ret = -EPERM;
> > -   goto out;
> > +   goto err_out;
> > }
> >
> > if (ctx.optlen == -1) {
> > @@ -1379,7 +1379,6 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock 
> > *sk, int *level,
> > ret = -EFAULT;
> > } else {
> > /* optlen within bounds, run kernel handler */
> > -   ret = 0;
> >
> > /* export any potential modifications */
> > *level = ctx.level;
> > @@ -1391,12 +1390,15 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock 
> > *sk, int *level,
> > if (ctx.optlen != 0) {
> > *optlen = ctx.optlen;
> > *kernel_optval = ctx.optval;
> > +   } else {
> > +   sockopt_free_buf(&ctx);
> > }
> > +
> > +   return 0;
> > }
> >
> > -out:
> > -   if (ret)
> > -   sockopt_free_buf(&ctx);
> > +err_out:
> > +   sockopt_free_buf(&ctx);
> > return ret;
> >  }


Re: [PATCH bpf v2] bpf: don't leak memory in bpf getsockopt when optlen == 0

2021-01-12 Thread Martin KaFai Lau
On Tue, Jan 12, 2021 at 08:28:29AM -0800, Stanislav Fomichev wrote:
> optlen == 0 indicates that the kernel should ignore BPF buffer
> and use the original one from the user. We, however, forget
> to free the temporary buffer that we've allocated for BPF.
Acked-by: Martin KaFai Lau 


Re: [PATCH bpf-next 1/4] bpf: enable task local storage for tracing programs

2021-01-12 Thread Martin KaFai Lau
On Mon, Jan 11, 2021 at 03:41:26PM -0800, Song Liu wrote:
> 
> 
> > On Jan 11, 2021, at 10:56 AM, Martin Lau  wrote:
> > 
> > On Fri, Jan 08, 2021 at 03:19:47PM -0800, Song Liu wrote:
> > 
> > [ ... ]
> > 
> >> diff --git a/kernel/bpf/bpf_local_storage.c 
> >> b/kernel/bpf/bpf_local_storage.c
> >> index dd5aedee99e73..9bd47ad2b26f1 100644
> >> --- a/kernel/bpf/bpf_local_storage.c
> >> +++ b/kernel/bpf/bpf_local_storage.c
> >> @@ -140,17 +140,18 @@ static void __bpf_selem_unlink_storage(struct 
> >> bpf_local_storage_elem *selem)
> >> {
> >>struct bpf_local_storage *local_storage;
> >>bool free_local_storage = false;
> >> +  unsigned long flags;
> >> 
> >>if (unlikely(!selem_linked_to_storage(selem)))
> >>/* selem has already been unlinked from sk */
> >>return;
> >> 
> >>local_storage = rcu_dereference(selem->local_storage);
> >> -  raw_spin_lock_bh(&local_storage->lock);
> >> +  raw_spin_lock_irqsave(&local_storage->lock, flags);
> > It will be useful to have a few words in commit message on this change
> > for future reference purpose.
> > 
> > Please also remove the in_irq() check from bpf_sk_storage.c
> > to avoid confusion in the future.  It probably should
> > be in a separate patch.
> 
> Do you mean we allow bpf_sk_storage_get_tracing() and 
> bpf_sk_storage_delete_tracing() in irq context? Like
Right.

However, after another thought, may be lets skip that for now
till a use case comes up and a test can be written.


[PATCH bpf 0/3] bpf: Enforce NULL check on new _OR_NULL return types

2020-10-19 Thread Martin KaFai Lau
This set enforces NULL check on the new helper return types,
RET_PTR_TO_BTF_ID_OR_NULL and RET_PTR_TO_MEM_OR_BTF_ID_OR_NULL.

Martin KaFai Lau (3):
  bpf: Enforce id generation for all may-be-null register type
  bpf: selftest: Ensure the return value of bpf_skc_to helpers must be
checked
  bpf: selftest: Ensure the return value of the bpf_per_cpu_ptr() must
be checked

 kernel/bpf/verifier.c | 11 ++--
 .../selftests/bpf/prog_tests/ksyms_btf.c  | 57 +--
 .../bpf/progs/test_ksyms_btf_null_check.c | 31 ++
 tools/testing/selftests/bpf/verifier/sock.c   | 25 
 4 files changed, 100 insertions(+), 24 deletions(-)
 create mode 100644 
tools/testing/selftests/bpf/progs/test_ksyms_btf_null_check.c

-- 
2.24.1



[PATCH bpf 2/3] bpf: selftest: Ensure the return value of bpf_skc_to helpers must be checked

2020-10-19 Thread Martin KaFai Lau
This patch tests:

int bpf_cls(struct __sk_buff *skb)
{
/* REG_6: sk
 * REG_7: tp
 * REG_8: req_sk
 */

sk = skb->sk;
if (!sk)
return 0;

tp = bpf_skc_to_tcp_sock(sk);
req_sk = bpf_skc_to_tcp_request_sock(sk);
if (!req_sk)
return 0;

/* !tp has not been tested, so verifier should reject. */
return *(__u8 *)tp;
}

Signed-off-by: Martin KaFai Lau 
---
 tools/testing/selftests/bpf/verifier/sock.c | 25 +
 1 file changed, 25 insertions(+)

diff --git a/tools/testing/selftests/bpf/verifier/sock.c 
b/tools/testing/selftests/bpf/verifier/sock.c
index b1aac2641498..ce13ece08d51 100644
--- a/tools/testing/selftests/bpf/verifier/sock.c
+++ b/tools/testing/selftests/bpf/verifier/sock.c
@@ -631,3 +631,28 @@
.prog_type = BPF_PROG_TYPE_SK_REUSEPORT,
.result = ACCEPT,
 },
+{
+   "mark null check on return value of bpf_skc_to helpers",
+   .insns = {
+   BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, 
sk)),
+   BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+   BPF_EMIT_CALL(BPF_FUNC_skc_to_tcp_sock),
+   BPF_MOV64_REG(BPF_REG_7, BPF_REG_0),
+   BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+   BPF_EMIT_CALL(BPF_FUNC_skc_to_tcp_request_sock),
+   BPF_MOV64_REG(BPF_REG_8, BPF_REG_0),
+   BPF_JMP_IMM(BPF_JNE, BPF_REG_8, 0, 2),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_7, 0),
+   BPF_EXIT_INSN(),
+   },
+   .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .result = REJECT,
+   .errstr = "invalid mem access",
+   .result_unpriv = REJECT,
+   .errstr_unpriv = "unknown func",
+},
-- 
2.24.1



[PATCH bpf 3/3] bpf: selftest: Ensure the return value of the bpf_per_cpu_ptr() must be checked

2020-10-19 Thread Martin KaFai Lau
This patch tests all pointers returned by bpf_per_cpu_ptr() must be
tested for NULL first before it can be accessed.

This patch adds a subtest "null_check", so it moves the ".data..percpu"
existence check to the very beginning and before doing any subtest.

Signed-off-by: Martin KaFai Lau 
---
 .../selftests/bpf/prog_tests/ksyms_btf.c  | 57 +--
 .../bpf/progs/test_ksyms_btf_null_check.c | 31 ++
 2 files changed, 70 insertions(+), 18 deletions(-)
 create mode 100644 
tools/testing/selftests/bpf/progs/test_ksyms_btf_null_check.c

diff --git a/tools/testing/selftests/bpf/prog_tests/ksyms_btf.c 
b/tools/testing/selftests/bpf/prog_tests/ksyms_btf.c
index 28e26bd3e0ca..b58b775d19f3 100644
--- a/tools/testing/selftests/bpf/prog_tests/ksyms_btf.c
+++ b/tools/testing/selftests/bpf/prog_tests/ksyms_btf.c
@@ -5,18 +5,17 @@
 #include 
 #include 
 #include "test_ksyms_btf.skel.h"
+#include "test_ksyms_btf_null_check.skel.h"
 
 static int duration;
 
-void test_ksyms_btf(void)
+static void test_basic(void)
 {
__u64 runqueues_addr, bpf_prog_active_addr;
__u32 this_rq_cpu;
int this_bpf_prog_active;
struct test_ksyms_btf *skel = NULL;
struct test_ksyms_btf__data *data;
-   struct btf *btf;
-   int percpu_datasec;
int err;
 
err = kallsyms_find("runqueues", &runqueues_addr);
@@ -31,20 +30,6 @@ void test_ksyms_btf(void)
if (CHECK(err == -ENOENT, "ksym_find", "symbol 'bpf_prog_active' not 
found\n"))
return;
 
-   btf = libbpf_find_kernel_btf();
-   if (CHECK(IS_ERR(btf), "btf_exists", "failed to load kernel BTF: %ld\n",
- PTR_ERR(btf)))
-   return;
-
-   percpu_datasec = btf__find_by_name_kind(btf, ".data..percpu",
-   BTF_KIND_DATASEC);
-   if (percpu_datasec < 0) {
-   printf("%s:SKIP:no PERCPU DATASEC in kernel btf\n",
-  __func__);
-   test__skip();
-   goto cleanup;
-   }
-
skel = test_ksyms_btf__open_and_load();
if (CHECK(!skel, "skel_open", "failed to open and load skeleton\n"))
goto cleanup;
@@ -83,6 +68,42 @@ void test_ksyms_btf(void)
  data->out__bpf_prog_active);
 
 cleanup:
-   btf__free(btf);
test_ksyms_btf__destroy(skel);
 }
+
+static void test_null_check(void)
+{
+   struct test_ksyms_btf_null_check *skel;
+
+   skel = test_ksyms_btf_null_check__open_and_load();
+   CHECK(skel, "skel_open", "unexpected load of a prog missing null 
check\n");
+
+   test_ksyms_btf_null_check__destroy(skel);
+}
+
+void test_ksyms_btf(void)
+{
+   int percpu_datasec;
+   struct btf *btf;
+
+   btf = libbpf_find_kernel_btf();
+   if (CHECK(IS_ERR(btf), "btf_exists", "failed to load kernel BTF: %ld\n",
+ PTR_ERR(btf)))
+   return;
+
+   percpu_datasec = btf__find_by_name_kind(btf, ".data..percpu",
+   BTF_KIND_DATASEC);
+   btf__free(btf);
+   if (percpu_datasec < 0) {
+   printf("%s:SKIP:no PERCPU DATASEC in kernel btf\n",
+  __func__);
+   test__skip();
+   return;
+   }
+
+   if (test__start_subtest("basic"))
+   test_basic();
+
+   if (test__start_subtest("null_check"))
+   test_null_check();
+}
diff --git a/tools/testing/selftests/bpf/progs/test_ksyms_btf_null_check.c 
b/tools/testing/selftests/bpf/progs/test_ksyms_btf_null_check.c
new file mode 100644
index ..8bc8f7c637bc
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_ksyms_btf_null_check.c
@@ -0,0 +1,31 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+
+#include "vmlinux.h"
+
+#include 
+
+extern const struct rq runqueues __ksym; /* struct type global var. */
+extern const int bpf_prog_active __ksym; /* int type global var. */
+
+SEC("raw_tp/sys_enter")
+int handler(const void *ctx)
+{
+   struct rq *rq;
+   int *active;
+   __u32 cpu;
+
+   cpu = bpf_get_smp_processor_id();
+   rq = (struct rq *)bpf_per_cpu_ptr(&runqueues, cpu);
+   active = (int *)bpf_per_cpu_ptr(&bpf_prog_active, cpu);
+   if (active) {
+   /* READ_ONCE */
+   *(volatile int *)active;
+   /* !rq has not been tested, so verifier should reject. */
+   *(volatile int *)(&rq->cpu);
+   }
+
+   return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.24.1



[PATCH bpf 1/3] bpf: Enforce id generation for all may-be-null register type

2020-10-19 Thread Martin KaFai Lau
The commit af7ec1383361 ("bpf: Add bpf_skc_to_tcp6_sock() helper")
introduces RET_PTR_TO_BTF_ID_OR_NULL and
the commit eaa6bcb71ef6 ("bpf: Introduce bpf_per_cpu_ptr()")
introduces RET_PTR_TO_MEM_OR_BTF_ID_OR_NULL.
Note that for RET_PTR_TO_MEM_OR_BTF_ID_OR_NULL, the reg0->type
could become PTR_TO_MEM_OR_NULL which is not covered by
BPF_PROBE_MEM.

The BPF_REG_0 will then hold a _OR_NULL pointer type. This _OR_NULL
pointer type requires the bpf program to explicitly do a NULL check first.
After NULL check, the verifier will mark all registers having
the same reg->id as safe to use.  However, the reg->id
is not set for those new _OR_NULL return types.  One of the ways
that may be wrong is, checking NULL for one btf_id typed pointer will
end up validating all other btf_id typed pointers because
all of them have id == 0.  The later tests will exercise
this path.

To fix it and also avoid similar issue in the future, this patch
moves the id generation logic out of each individual RET type
test in check_helper_call().  Instead, it does one
reg_type_may_be_null() test and then do the id generation
if needed.

This patch also adds a WARN_ON_ONCE in mark_ptr_or_null_reg()
to catch future breakage.

The _OR_NULL pointer usage in the bpf_iter_reg.ctx_arg_info is
fine because it just happens that the existing id generation after
check_ctx_access() has covered it.  It is also using the
reg_type_may_be_null() to decide if id generation is needed or not.

Fixes: af7ec1383361 ("bpf: Add bpf_skc_to_tcp6_sock() helper")
Fixes: eaa6bcb71ef6 ("bpf: Introduce bpf_per_cpu_ptr()")
Cc: Yonghong Song 
Cc: Hao Luo 
Signed-off-by: Martin KaFai Lau 
---
 kernel/bpf/verifier.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 39d7f44e7c92..6200519582a6 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5133,24 +5133,19 @@ static int check_helper_call(struct bpf_verifier_env 
*env, int func_id, int insn
regs[BPF_REG_0].id = ++env->id_gen;
} else {
regs[BPF_REG_0].type = PTR_TO_MAP_VALUE_OR_NULL;
-   regs[BPF_REG_0].id = ++env->id_gen;
}
} else if (fn->ret_type == RET_PTR_TO_SOCKET_OR_NULL) {
mark_reg_known_zero(env, regs, BPF_REG_0);
regs[BPF_REG_0].type = PTR_TO_SOCKET_OR_NULL;
-   regs[BPF_REG_0].id = ++env->id_gen;
} else if (fn->ret_type == RET_PTR_TO_SOCK_COMMON_OR_NULL) {
mark_reg_known_zero(env, regs, BPF_REG_0);
regs[BPF_REG_0].type = PTR_TO_SOCK_COMMON_OR_NULL;
-   regs[BPF_REG_0].id = ++env->id_gen;
} else if (fn->ret_type == RET_PTR_TO_TCP_SOCK_OR_NULL) {
mark_reg_known_zero(env, regs, BPF_REG_0);
regs[BPF_REG_0].type = PTR_TO_TCP_SOCK_OR_NULL;
-   regs[BPF_REG_0].id = ++env->id_gen;
} else if (fn->ret_type == RET_PTR_TO_ALLOC_MEM_OR_NULL) {
mark_reg_known_zero(env, regs, BPF_REG_0);
regs[BPF_REG_0].type = PTR_TO_MEM_OR_NULL;
-   regs[BPF_REG_0].id = ++env->id_gen;
regs[BPF_REG_0].mem_size = meta.mem_size;
} else if (fn->ret_type == RET_PTR_TO_MEM_OR_BTF_ID_OR_NULL ||
   fn->ret_type == RET_PTR_TO_MEM_OR_BTF_ID) {
@@ -5199,6 +5194,9 @@ static int check_helper_call(struct bpf_verifier_env 
*env, int func_id, int insn
return -EINVAL;
}
 
+   if (reg_type_may_be_null(regs[BPF_REG_0].type))
+   regs[BPF_REG_0].id = ++env->id_gen;
+
if (is_ptr_cast_function(func_id)) {
/* For release_reference() */
regs[BPF_REG_0].ref_obj_id = meta.ref_obj_id;
@@ -7212,7 +7210,8 @@ static void mark_ptr_or_null_reg(struct bpf_func_state 
*state,
 struct bpf_reg_state *reg, u32 id,
 bool is_null)
 {
-   if (reg_type_may_be_null(reg->type) && reg->id == id) {
+   if (reg_type_may_be_null(reg->type) && reg->id == id &&
+   !WARN_ON_ONCE(!reg->id)) {
/* Old offset (both fixed and variable parts) should
 * have been known-zero, because we don't allow pointer
 * arithmetic on pointers that might be NULL.
-- 
2.24.1



Re: [bpf-next PATCH 2/4] selftests/bpf: Drop python client/server in favor of threads

2020-10-28 Thread Martin KaFai Lau
On Tue, Oct 27, 2020 at 06:47:13PM -0700, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> Drop the tcp_client/server.py files in favor of using a client and server
> thread within the test case. Specifically we spawn a new thread to play the
> role of the server, and the main testing thread plays the role of client.
> 
> Doing this we are able to reduce overhead since we don't have two python
> workers possibly floating around. In addition we don't have to worry about
> synchronization issues and as such the retry loop waiting for the threads
> to close the sockets can be dropped as we will have already closed the
> sockets in the local executable and synchronized the server thread.
Thanks for working on this.

> 
> Signed-off-by: Alexander Duyck 
> ---
>  .../testing/selftests/bpf/prog_tests/tcpbpf_user.c |  125 
> +---
>  tools/testing/selftests/bpf/tcp_client.py  |   50 
>  tools/testing/selftests/bpf/tcp_server.py  |   80 -
>  3 files changed, 107 insertions(+), 148 deletions(-)
>  delete mode 100755 tools/testing/selftests/bpf/tcp_client.py
>  delete mode 100755 tools/testing/selftests/bpf/tcp_server.py
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c 
> b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
> index 5becab8b04e3..71ab82e37eb7 100644
> --- a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
> +++ b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
> @@ -1,14 +1,65 @@
>  // SPDX-License-Identifier: GPL-2.0
>  #include 
>  #include 
> +#include 
>  
>  #include "test_tcpbpf.h"
>  #include "cgroup_helpers.h"
>  
> +#define LO_ADDR6 "::1"
>  #define CG_NAME "/tcpbpf-user-test"
>  
> -/* 3 comes from one listening socket + both ends of the connection */
> -#define EXPECTED_CLOSE_EVENTS3
> +static pthread_mutex_t server_started_mtx = PTHREAD_MUTEX_INITIALIZER;
> +static pthread_cond_t server_started = PTHREAD_COND_INITIALIZER;
> +
> +static void *server_thread(void *arg)
> +{
> + struct sockaddr_storage addr;
> + socklen_t len = sizeof(addr);
> + int fd = *(int *)arg;
> + char buf[1000];
> + int client_fd;
> + int err = 0;
> + int i;
> +
> + err = listen(fd, 1);
This is not needed.  start_server() has done it.

> +
> + pthread_mutex_lock(&server_started_mtx);
> + pthread_cond_signal(&server_started);
> + pthread_mutex_unlock(&server_started_mtx);
> +
> + if (err < 0) {
> + perror("Failed to listen on socket");
> + err = errno;
> + goto err;
> + }
> +
> + client_fd = accept(fd, (struct sockaddr *)&addr, &len);
> + if (client_fd < 0) {
> + perror("Failed to accept client");
> + err = errno;
> + goto err;
> + }
> +
> + if (recv(client_fd, buf, 1000, 0) < 1000) {
> + perror("failed/partial recv");
> + err = errno;
> + goto out_clean;
> + }
> +
> + for (i = 0; i < 500; i++)
> + buf[i] = '.';
> +
> + if (send(client_fd, buf, 500, 0) < 500) {
> + perror("failed/partial send");
> + err = errno;
> + goto out_clean;
> + }
> +out_clean:
> + close(client_fd);
> +err:
> + return (void *)(long)err;
> +}
>  
>  #define EXPECT_EQ(expected, actual, fmt) \
>   do {\
> @@ -43,7 +94,9 @@ int verify_result(const struct tcpbpf_globals *result)
>   EXPECT_EQ(0x80, result->bad_cb_test_rv, PRIu32);
>   EXPECT_EQ(0, result->good_cb_test_rv, PRIu32);
>   EXPECT_EQ(1, result->num_listen, PRIu32);
> - EXPECT_EQ(EXPECTED_CLOSE_EVENTS, result->num_close_events, PRIu32);
> +
> + /* 3 comes from one listening socket + both ends of the connection */
> + EXPECT_EQ(3, result->num_close_events, PRIu32);
>  
>   return ret;
>  }
> @@ -67,6 +120,52 @@ int verify_sockopt_result(int sock_map_fd)
>   return ret;
>  }
>  
> +static int run_test(void)
> +{
> + int server_fd, client_fd;
> + void *server_err;
> + char buf[1000];
> + pthread_t tid;
> + int err = -1;
> + int i;
> +
> + server_fd = start_server(AF_INET6, SOCK_STREAM, LO_ADDR6, 0, 0);
> + if (CHECK_FAIL(server_fd < 0))
> + return err;
> +
> + pthread_mutex_lock(&server_started_mtx);
> + if (CHECK_FAIL(pthread_create(&tid, NULL, server_thread,
> +   (void *)&server_fd)))
> + goto close_server_fd;
> +
> + pthread_cond_wait(&server_started, &server_started_mtx);
> + pthread_mutex_unlock(&server_started_mtx);
> +
> + client_fd = connect_to_fd(server_fd, 0);
> + if (client_fd < 0)
> + goto close_server_fd;
> +
> + for (i = 0; i < 1000; i++)
> + buf[i] = '+';
> +
> + if (CHECK_FAIL(send(client_fd, buf, 1000, 0) < 1000))
> + goto close_client_fd;
> +
> + if (CHECK_FAIL(recv(c

Re: [bpf-next PATCH 2/4] selftests/bpf: Drop python client/server in favor of threads

2020-10-29 Thread Martin KaFai Lau
On Thu, Oct 29, 2020 at 09:58:15AM -0700, Alexander Duyck wrote:
[ ... ]

> > > @@ -43,7 +94,9 @@ int verify_result(const struct tcpbpf_globals *result)
> > >   EXPECT_EQ(0x80, result->bad_cb_test_rv, PRIu32);
> > >   EXPECT_EQ(0, result->good_cb_test_rv, PRIu32);
> > >   EXPECT_EQ(1, result->num_listen, PRIu32);
> > > - EXPECT_EQ(EXPECTED_CLOSE_EVENTS, result->num_close_events, PRIu32);
> > > +
> > > + /* 3 comes from one listening socket + both ends of the connection 
> > > */
> > > + EXPECT_EQ(3, result->num_close_events, PRIu32);
> > >
> > >   return ret;
> > >  }
> > > @@ -67,6 +120,52 @@ int verify_sockopt_result(int sock_map_fd)
> > >   return ret;
> > >  }
> > >
> > > +static int run_test(void)
> > > +{
> > > + int server_fd, client_fd;
> > > + void *server_err;
> > > + char buf[1000];
> > > + pthread_t tid;
> > > + int err = -1;
> > > + int i;
> > > +
> > > + server_fd = start_server(AF_INET6, SOCK_STREAM, LO_ADDR6, 0, 0);
> > > + if (CHECK_FAIL(server_fd < 0))
> > > + return err;
> > > +
> > > + pthread_mutex_lock(&server_started_mtx);
> > > + if (CHECK_FAIL(pthread_create(&tid, NULL, server_thread,
> > > +   (void *)&server_fd)))
> > > + goto close_server_fd;
> > > +
> > > + pthread_cond_wait(&server_started, &server_started_mtx);
> > > + pthread_mutex_unlock(&server_started_mtx);
> > > +
> > > + client_fd = connect_to_fd(server_fd, 0);
> > > + if (client_fd < 0)
> > > + goto close_server_fd;
> > > +
> > > + for (i = 0; i < 1000; i++)
> > > + buf[i] = '+';
> > > +
> > > + if (CHECK_FAIL(send(client_fd, buf, 1000, 0) < 1000))
> > > + goto close_client_fd;
> > > +
> > > + if (CHECK_FAIL(recv(client_fd, buf, 500, 0) < 500))
> > > + goto close_client_fd;
> > > +
> > > + pthread_join(tid, &server_err);
> > I think this can be further simplified without starting thread
> > and do everything in run_test() instead.
> >
> > Something like this (uncompiled code):
> >
> > accept_fd = accept(server_fd, NULL, 0);
> > send(client_fd, plus_buf, 1000, 0);
> > recv(accept_fd, recv_buf, 1000, 0);
> > send(accept_fd, dot_buf, 500, 0);
> > recv(client_fd, recv_buf, 500, 0);
> 
> I can take a look at switching it over.
> 
> > > +
> > > + err = (int)(long)server_err;
> > > + CHECK_FAIL(err);
> > > +
> > > +close_client_fd:
> > > + close(client_fd);
> > > +close_server_fd:
> > > + close(server_fd);
> > > + return err;
> > > +}
> > > +
> > >  void test_tcpbpf_user(void)
> > >  {
> > >   const char *file = "test_tcpbpf_kern.o";
> > > @@ -74,7 +173,6 @@ void test_tcpbpf_user(void)
> > >   struct tcpbpf_globals g = {0};
> > >   struct bpf_object *obj;
> > >   int cg_fd = -1;
> > > - int retry = 10;
> > >   __u32 key = 0;
> > >   int rv;
> > >
> > > @@ -94,11 +192,6 @@ void test_tcpbpf_user(void)
> > >   goto err;
> > >   }
> > >
> > > - if (CHECK_FAIL(system("./tcp_server.py"))) {
> > > - fprintf(stderr, "FAILED: TCP server\n");
> > > - goto err;
> > > - }
> > > -
> > >   map_fd = bpf_find_map(__func__, obj, "global_map");
> > >   if (CHECK_FAIL(map_fd < 0))
> > >   goto err;
> > > @@ -107,21 +200,17 @@ void test_tcpbpf_user(void)
> > >   if (CHECK_FAIL(sock_map_fd < 0))
> > >   goto err;
> > >
> > > -retry_lookup:
> > > + if (run_test()) {
> > > + fprintf(stderr, "FAILED: TCP server\n");
> > > + goto err;
> > > + }
> > > +
> > >   rv = bpf_map_lookup_elem(map_fd, &key, &g);
> > >   if (CHECK_FAIL(rv != 0)) {
> > CHECK() is a better one here if it needs to output error message.
> > The same goes for similar usages in this patch set.
> >
> > For the start_server() above which has already logged the error message,
> > CHECK_FAIL() is good enough.
> >
> > >   fprintf(stderr, "FAILED: bpf_map_lookup_elem returns %d\n", 
> > > rv);
> > >   goto err;
> > >   }
> > >
> > > - if (g.num_close_events != EXPECTED_CLOSE_EVENTS && retry--) {
> > It is good to have a solution to avoid a test depending on some number
> > of retries.
> >
> > After looking at BPF_SOCK_OPS_STATE_CB in test_tcpbpf_kern.c,
> > it is not clear to me removing python alone is enough to avoid the
> > race (so the retry--).  One of the sk might still be in TCP_LAST_ACK
> > instead of TCP_CLOSE.
> >
> 
> After you pointed this out I decided to go back through and do some
> further testing. After testing this for several thousand iterations it
> does look like the issue can still happen, it was just significantly
> less frequent with the threaded approach, but it was still there. So I
> will go back through and add this back and then fold it into the
> verify_results function in the third patch. Although I might reduce
> 

Re: [bpf-next PATCH v2 2/5] selftests/bpf: Drop python client/server in favor of threads

2020-11-02 Thread Martin KaFai Lau
On Sat, Oct 31, 2020 at 11:52:18AM -0700, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> Drop the tcp_client/server.py files in favor of using a client and server
> thread within the test case. Specifically we spawn a new thread to play the
The thread comment may be outdated in v2.

> role of the server, and the main testing thread plays the role of client.
> 
> Add logic to the end of the run_test function to guarantee that the sockets
> are closed when we begin verifying results.
> 
> Doing this we are able to reduce overhead since we don't have two python
> workers possibly floating around. In addition we don't have to worry about
> synchronization issues and as such the retry loop waiting for the threads
> to close the sockets can be dropped as we will have already closed the
> sockets in the local executable and synchronized the server thread.
> 
> Signed-off-by: Alexander Duyck 
> ---
>  .../testing/selftests/bpf/prog_tests/tcpbpf_user.c |   96 
> 
>  tools/testing/selftests/bpf/tcp_client.py  |   50 --
>  tools/testing/selftests/bpf/tcp_server.py  |   80 -
>  3 files changed, 78 insertions(+), 148 deletions(-)
>  delete mode 100755 tools/testing/selftests/bpf/tcp_client.py
>  delete mode 100755 tools/testing/selftests/bpf/tcp_server.py
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c 
> b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
> index 54f1dce97729..17d4299435df 100644
> --- a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
> +++ b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
> @@ -1,13 +1,14 @@
>  // SPDX-License-Identifier: GPL-2.0
>  #include 
>  #include 
> +#include 
>  
>  #include "test_tcpbpf.h"
>  
> +#define LO_ADDR6 "::1"
>  #define CG_NAME "/tcpbpf-user-test"
>  
> -/* 3 comes from one listening socket + both ends of the connection */
> -#define EXPECTED_CLOSE_EVENTS3
> +static __u32 duration;
>  
>  #define EXPECT_EQ(expected, actual, fmt) \
>   do {\
> @@ -42,7 +43,9 @@ int verify_result(const struct tcpbpf_globals *result)
>   EXPECT_EQ(0x80, result->bad_cb_test_rv, PRIu32);
>   EXPECT_EQ(0, result->good_cb_test_rv, PRIu32);
>   EXPECT_EQ(1, result->num_listen, PRIu32);
> - EXPECT_EQ(EXPECTED_CLOSE_EVENTS, result->num_close_events, PRIu32);
> +
> + /* 3 comes from one listening socket + both ends of the connection */
> + EXPECT_EQ(3, result->num_close_events, PRIu32);
>  
>   return ret;
>  }
> @@ -66,6 +69,75 @@ int verify_sockopt_result(int sock_map_fd)
>   return ret;
>  }
>  
> +static int run_test(void)
> +{
> + int listen_fd = -1, cli_fd = -1, accept_fd = -1;
> + char buf[1000];
> + int err = -1;
> + int i;
> +
> + listen_fd = start_server(AF_INET6, SOCK_STREAM, LO_ADDR6, 0, 0);
> + if (CHECK(listen_fd == -1, "start_server", "listen_fd:%d errno:%d\n",
> +   listen_fd, errno))
> + goto done;
> +
> + cli_fd = connect_to_fd(listen_fd, 0);
> + if (CHECK(cli_fd == -1, "connect_to_fd(listen_fd)",
> +   "cli_fd:%d errno:%d\n", cli_fd, errno))
> + goto done;
> +
> + accept_fd = accept(listen_fd, NULL, NULL);
> + if (CHECK(accept_fd == -1, "accept(listen_fd)",
> +   "accept_fd:%d errno:%d\n", accept_fd, errno))
> + goto done;
> +
> + /* Send 1000B of '+'s from cli_fd -> accept_fd */
> + for (i = 0; i < 1000; i++)
> + buf[i] = '+';
> +
> + err = send(cli_fd, buf, 1000, 0);
> + if (CHECK(err != 1000, "send(cli_fd)", "err:%d errno:%d\n", err, errno))
> + goto done;
> +
> + err = recv(accept_fd, buf, 1000, 0);
> + if (CHECK(err != 1000, "recv(accept_fd)", "err:%d errno:%d\n", err, 
> errno))
> + goto done;
> +
> + /* Send 500B of '.'s from accept_fd ->cli_fd */
> + for (i = 0; i < 500; i++)
> + buf[i] = '.';
> +
> + err = send(accept_fd, buf, 500, 0);
> + if (CHECK(err != 500, "send(accept_fd)", "err:%d errno:%d\n", err, 
> errno))
> + goto done;
> +
> + err = recv(cli_fd, buf, 500, 0);
Unlikely, but err from the above send()/recv() could be 0.


> + if (CHECK(err != 500, "recv(cli_fd)", "err:%d errno:%d\n", err, errno))
> + goto done;
> +
> + /*
> +  * shutdown accept first to guarantee correct ordering for
> +  * bytes_received and bytes_acked when we go to verify the results.
> +  */
> + shutdown(accept_fd, SHUT_WR);
> + err = recv(cli_fd, buf, 1, 0);
> + if (CHECK(err, "recv(cli_fd) for fin", "err:%d errno:%d\n", err, errno))
> + goto done;
> +
> + shutdown(cli_fd, SHUT_WR);
> + err = recv(accept_fd, buf, 1, 0);
hmm... I was thinking cli_fd may still be in TCP_LAST_ACK
but we can go with this version first and see if CI could
really hit this case before resurrecting the idea on t

Re: [bpf-next PATCH v2 3/5] selftests/bpf: Replace EXPECT_EQ with ASSERT_EQ and refactor verify_results

2020-11-02 Thread Martin KaFai Lau
On Sat, Oct 31, 2020 at 11:52:24AM -0700, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> There is already logic in test_progs.h for asserting that a value is
> expected to be another value. So instead of reinventing it we should just
> make use of ASSERT_EQ in tcpbpf_user.c. This will allow for better
> debugging and integrates much more closely with the test_progs framework.
> 
> In addition we can refactor the code a bit to merge together the two
> verify functions and tie them together into a single function. Doing this
> helps to clean the code up a bit and makes it more readable as all the
> verification is now done in one function.
> 
> Lastly we can relocate the verification to the end of the run_test since it
> is logically part of the test itself. With this we can drop the need for a
> return value from run_test since verification becomes the last step of the
> call and then immediately following is the tear down of the test setup.
> 
> Signed-off-by: Alexander Duyck 
Acked-by: Martin KaFai Lau 

> ---
>  .../testing/selftests/bpf/prog_tests/tcpbpf_user.c |  114 
> 
>  1 file changed, 44 insertions(+), 70 deletions(-)
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c 
> b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
> index 17d4299435df..d96f4084d2f5 100644
> --- a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
> +++ b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
> @@ -10,66 +10,58 @@
>  
>  static __u32 duration;
>  
> -#define EXPECT_EQ(expected, actual, fmt) \
> - do {\
> - if ((expected) != (actual)) {   \
> - fprintf(stderr, "  Value of: " #actual "\n" \
> -"Actual: %" fmt "\n" \
> -"  Expected: %" fmt "\n",\
> -(actual), (expected));   \
> - ret--;  \
> - }   \
> - } while (0)
> -
> -int verify_result(const struct tcpbpf_globals *result)
> -{
> - __u32 expected_events;
> - int ret = 0;
> -
> - expected_events = ((1 << BPF_SOCK_OPS_TIMEOUT_INIT) |
> -(1 << BPF_SOCK_OPS_RWND_INIT) |
> -(1 << BPF_SOCK_OPS_TCP_CONNECT_CB) |
> -(1 << BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB) |
> -(1 << BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB) |
> -(1 << BPF_SOCK_OPS_NEEDS_ECN) |
> -(1 << BPF_SOCK_OPS_STATE_CB) |
> -(1 << BPF_SOCK_OPS_TCP_LISTEN_CB));
> -
> - EXPECT_EQ(expected_events, result->event_map, "#" PRIx32);
> - EXPECT_EQ(501ULL, result->bytes_received, "llu");
> - EXPECT_EQ(1002ULL, result->bytes_acked, "llu");
> - EXPECT_EQ(1, result->data_segs_in, PRIu32);
> - EXPECT_EQ(1, result->data_segs_out, PRIu32);
> - EXPECT_EQ(0x80, result->bad_cb_test_rv, PRIu32);
> - EXPECT_EQ(0, result->good_cb_test_rv, PRIu32);
> - EXPECT_EQ(1, result->num_listen, PRIu32);
> -
> - /* 3 comes from one listening socket + both ends of the connection */
> - EXPECT_EQ(3, result->num_close_events, PRIu32);
> -
> - return ret;
> -}
> -
> -int verify_sockopt_result(int sock_map_fd)
> +static void verify_result(int map_fd, int sock_map_fd)
>  {
> + __u32 expected_events = ((1 << BPF_SOCK_OPS_TIMEOUT_INIT) |
> +  (1 << BPF_SOCK_OPS_RWND_INIT) |
> +  (1 << BPF_SOCK_OPS_TCP_CONNECT_CB) |
> +  (1 << BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB) |
> +  (1 << BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB) |
> +  (1 << BPF_SOCK_OPS_NEEDS_ECN) |
> +  (1 << BPF_SOCK_OPS_STATE_CB) |
> +  (1 << BPF_SOCK_OPS_TCP_LISTEN_CB));
> + struct tcpbpf_globals result = { 0 };
nit. init is not needed.

>   __u32 key = 0;
> - int ret = 0;
>   int res;
>   int rv;
>  
> + rv = bpf_map_lookup_elem(map_fd, &key, &result);
> + if (CHECK(rv, "bpf_map_lookup_elem(map_fd)", "err:%d errno:%d",
> +   rv, errno))
> + return;
> +
> + /* check global map */
> + CHECK(expected_events !

Re: [bpf-next PATCH v2 4/5] selftests/bpf: Migrate tcpbpf_user.c to use BPF skeleton

2020-11-02 Thread Martin KaFai Lau
On Sat, Oct 31, 2020 at 11:52:31AM -0700, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> Update tcpbpf_user.c to make use of the BPF skeleton. Doing this we can
> simplify test_tcpbpf_user and reduce the overhead involved in setting up
> the test.
> 
> In addition we can clean up the remaining bits such as the one remaining
> CHECK_FAIL at the end of test_tcpbpf_user so that the function only makes
> use of CHECK as needed.
> 
> Acked-by: Andrii Nakryiko 
> Signed-off-by: Alexander Duyck 
Acked-by: Martin KaFai Lau 

> ---
>  .../testing/selftests/bpf/prog_tests/tcpbpf_user.c |   48 
> 
>  1 file changed, 18 insertions(+), 30 deletions(-)
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c 
> b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
> index d96f4084d2f5..c7a61b0d616a 100644
> --- a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
> +++ b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
> @@ -4,6 +4,7 @@
>  #include 
>  
>  #include "test_tcpbpf.h"
> +#include "test_tcpbpf_kern.skel.h"
>  
>  #define LO_ADDR6 "::1"
>  #define CG_NAME "/tcpbpf-user-test"
> @@ -133,44 +134,31 @@ static void run_test(int map_fd, int sock_map_fd)
>  
>  void test_tcpbpf_user(void)
>  {
> - const char *file = "test_tcpbpf_kern.o";
> - int prog_fd, map_fd, sock_map_fd;
> - int error = EXIT_FAILURE;
> - struct bpf_object *obj;
> + struct test_tcpbpf_kern *skel;
> + int map_fd, sock_map_fd;
>   int cg_fd = -1;
> - int rv;
> -
> - cg_fd = test__join_cgroup(CG_NAME);
> - if (cg_fd < 0)
> - goto err;
>  
> - if (bpf_prog_load(file, BPF_PROG_TYPE_SOCK_OPS, &obj, &prog_fd)) {
> - fprintf(stderr, "FAILED: load_bpf_file failed for: %s\n", file);
> - goto err;
> - }
> + skel = test_tcpbpf_kern__open_and_load();
> + if (CHECK(!skel, "open and load skel", "failed"))
> + return;
>  
> - rv = bpf_prog_attach(prog_fd, cg_fd, BPF_CGROUP_SOCK_OPS, 0);
> - if (rv) {
> - fprintf(stderr, "FAILED: bpf_prog_attach: %d (%s)\n",
> -errno, strerror(errno));
> - goto err;
> - }
> + cg_fd = test__join_cgroup(CG_NAME);
> + if (CHECK(cg_fd < 0, "test__join_cgroup(" CG_NAME ")",
> +   "cg_fd:%d errno:%d", cg_fd, errno))
> + goto cleanup_skel;
>  
> - map_fd = bpf_find_map(__func__, obj, "global_map");
> - if (map_fd < 0)
> - goto err;
> + map_fd = bpf_map__fd(skel->maps.global_map);
> + sock_map_fd = bpf_map__fd(skel->maps.sockopt_results);
>  
> - sock_map_fd = bpf_find_map(__func__, obj, "sockopt_results");
> - if (sock_map_fd < 0)
> - goto err;
> + skel->links.bpf_testcb = 
> bpf_program__attach_cgroup(skel->progs.bpf_testcb, cg_fd);
> + if (ASSERT_OK_PTR(skel->links.bpf_testcb, "attach_cgroup(bpf_testcb)"))
> + goto cleanup_namespace;
>  
>   run_test(map_fd, sock_map_fd);
>  
> - error = 0;
> -err:
> - bpf_prog_detach(cg_fd, BPF_CGROUP_SOCK_OPS);
> +cleanup_namespace:
nit.

may be "cleanup_cgroup" instead?

or only have one jump label to handle failure since "cg_fd != -1" has been
tested already.

>   if (cg_fd != -1)
>   close(cg_fd);
> -
> - CHECK_FAIL(error);
> +cleanup_skel:
> + test_tcpbpf_kern__destroy(skel);
>  }
> 
> 


  1   2   3   4   5   6   7   8   9   10   >