Re: [PATCH RFC 0/2] kproxy: Kernel Proxy

2017-06-29 Thread Thomas Graf
On 29 June 2017 at 16:21, Tom Herbert  wrote:
> I think the main part of that discussion was around stream parser
> which is needed for message delineation. For a 1:1 proxy,  KCM is
> probably overkill (the whole KCM data path and lock becomes
> superfluous). Also, there's no concept of creating a whole message
> before routing it, in the 1:1 case we should let the message pass
> through once it's cleared by the filter (this is the strparser change
> I referred to). As I mentioned, for L7 load balancing we would want a
> multiplexor probably also M:N, but the structure is different since
> there's still no user facing sockets, they're all TCP for instance.
> IMO, the 1:1 proxy case is compelling to solve in itself...

I see. I was definitely thinking m:n. We should definitely evaluate
whether it makes sense to have a specific 1:1 implementation if we
need m:n anyway. For L7 LB, m:n seems obvious as a particular L4
connection may act as a transport for multiple requests bidirectional.
KCM looks like a good starting point for that.

When I talked about enqueueing entire messages, the main concern is to
buffer up the payload after the TLS handshake to the point to where a
forwarding decision can be made. I would definitely not advocate to
buffer entire messages before starting to forward.


Re: [PATCH RFC 0/2] kproxy: Kernel Proxy

2017-06-29 Thread Thomas Graf
Hi Tom

On 29 June 2017 at 11:27, Tom Herbert  wrote:
> This is raw, minimally tested, and error hanlding needs work. Posting
> as RFC to get feedback on the design...
>
> Sidecar proxies are becoming quite popular on server as a means to
> perform layer 7 processing on application data as it is sent. Such
> sidecars are used for SSL proxies, application firewalls, and L7
> load balancers. While these proxies provide nice functionality,
> their performance is obviously terrible since all the data needs
> to take an extra hop though userspace.

I really appreciate this work. It would have been nice to at least
attribute me in some way as this is exactly what I presented at
Netconf 2017 [0].

I'm also wondering why this is not built on top of KCM which you
suggested to use when we discussed this.

[0] 
https://docs.google.com/presentation/d/1dwSKSBGpUHD3WO5xxzZWj8awV_-xL-oYhvqQMOBhhtk/edit#slide=id.g203aae413f_0_0


Re: [PATCH iproute2 -net-next] lwt: BPF support for LWT

2016-12-13 Thread Thomas Graf
On 13 December 2016 at 00:41, Stephen Hemminger
 wrote:
> I went ahead and fixed these.

Thanks for fixing it up Stephen.


Re: [PATCH net-next] bpf: fix state equivalence

2016-12-07 Thread Thomas Graf
On 12/07/16 at 10:57am, Alexei Starovoitov wrote:
> Commmits 57a09bf0a416 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL 
> registers")
> and 484611357c19 ("bpf: allow access into map value arrays") by themselves
> are correct, but in combination they make state equivalence ignore 'id' field
> of the register state which can lead to accepting invalid program.
> 
> Fixes: 57a09bf0a416 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL 
> registers")
> Fixes: 484611357c19 ("bpf: allow access into map value arrays")
> Signed-off-by: Alexei Starovoitov <a...@kernel.org>
> Acked-by: Daniel Borkmann <dan...@iogearbox.net>

Acked-by: Thomas Graf <tg...@suug.ch>


[PATCH iproute2 net-next] bpf: Fix number of retries when growing log buffer

2016-12-07 Thread Thomas Graf
The log buffer is automatically grown when the verifier output does not
fit into the default buffer size. The number of growing attempts was
not sufficient to reach the maximum buffer size so far.

Perform 9 iterations to reach max and let the 10th one fail.

j:0 i:65536 max:16777215
j:1 i:131072max:16777215
j:2 i:262144max:16777215
j:3 i:524288max:16777215
j:4 i:1048576   max:16777215
j:5 i:2097152   max:16777215
j:6 i:4194304   max:16777215
j:7 i:8388608   max:16777215
j:8 i:16777216  max:16777215

Signed-off-by: Thomas Graf <tg...@suug.ch>
Acked-by: Daniel Borkmann <dan...@iogearbox.net>
---
 lib/bpf.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/lib/bpf.c b/lib/bpf.c
index 8a5b84b..f3dace2 100644
--- a/lib/bpf.c
+++ b/lib/bpf.c
@@ -912,15 +912,18 @@ bpf_dump_error(struct bpf_elf_ctx *ctx, const char 
*format, ...)
 
 static int bpf_log_realloc(struct bpf_elf_ctx *ctx)
 {
+   const size_t log_max = UINT_MAX >> 8;
size_t log_size = ctx->log_size;
void *ptr;
 
if (!ctx->log) {
log_size = 65536;
-   } else {
+   } else if (log_size < log_max) {
log_size <<= 1;
-   if (log_size > (UINT_MAX >> 8))
-   return -EINVAL;
+   if (log_size > log_max)
+   log_size = log_max;
+   } else {
+   return -EINVAL;
}
 
ptr = realloc(ctx->log, log_size);
@@ -1259,7 +1262,7 @@ retry:
 * log for the user, so enlarge it and re-fail.
 */
if (fd < 0 && (errno == ENOSPC || !ctx->log_size)) {
-   if (tries++ < 6 && !bpf_log_realloc(ctx))
+   if (tries++ < 10 && !bpf_log_realloc(ctx))
goto retry;
 
fprintf(stderr, "Log buffer too small to dump verifier 
log %zu bytes (%d tries)!\n",
-- 
2.7.4



[PATCH net-next] bpf: add additional verifier tests for BPF_PROG_TYPE_LWT_*

2016-12-05 Thread Thomas Graf
 - direct packet read is allowed for LWT_*
 - direct packet write for LWT_IN/LWT_OUT is prohibited
 - direct packet write for LWT_XMIT is allowed
 - access to skb->tc_classid is prohibited for LWT_*

Signed-off-by: Thomas Graf <tg...@suug.ch>
Acked-by: Daniel Borkmann <dan...@iogearbox.net>
---
 tools/testing/selftests/bpf/test_verifier.c | 134 
 1 file changed, 134 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 5da2e9d..47f70a6 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -2683,6 +2683,140 @@ static struct bpf_test tests[] = {
.errstr_unpriv = "R0 pointer arithmetic prohibited",
.result_unpriv = REJECT,
},
+   {
+   "invalid direct packet write for LWT_IN",
+   .insns = {
+   BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1,
+   offsetof(struct __sk_buff, data)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_1,
+   offsetof(struct __sk_buff, data_end)),
+   BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
+   BPF_JMP_REG(BPF_JGT, BPF_REG_0, BPF_REG_3, 1),
+   BPF_STX_MEM(BPF_B, BPF_REG_2, BPF_REG_2, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .errstr = "cannot write into packet",
+   .result = REJECT,
+   .prog_type = BPF_PROG_TYPE_LWT_IN,
+   },
+   {
+   "invalid direct packet write for LWT_OUT",
+   .insns = {
+   BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1,
+   offsetof(struct __sk_buff, data)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_1,
+   offsetof(struct __sk_buff, data_end)),
+   BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
+   BPF_JMP_REG(BPF_JGT, BPF_REG_0, BPF_REG_3, 1),
+   BPF_STX_MEM(BPF_B, BPF_REG_2, BPF_REG_2, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .errstr = "cannot write into packet",
+   .result = REJECT,
+   .prog_type = BPF_PROG_TYPE_LWT_OUT,
+   },
+   {
+   "direct packet write for LWT_XMIT",
+   .insns = {
+   BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1,
+   offsetof(struct __sk_buff, data)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_1,
+   offsetof(struct __sk_buff, data_end)),
+   BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
+   BPF_JMP_REG(BPF_JGT, BPF_REG_0, BPF_REG_3, 1),
+   BPF_STX_MEM(BPF_B, BPF_REG_2, BPF_REG_2, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_LWT_XMIT,
+   },
+   {
+   "direct packet read for LWT_IN",
+   .insns = {
+   BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1,
+   offsetof(struct __sk_buff, data)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_1,
+   offsetof(struct __sk_buff, data_end)),
+   BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
+   BPF_JMP_REG(BPF_JGT, BPF_REG_0, BPF_REG_3, 1),
+   BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_2, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_LWT_IN,
+   },
+   {
+   "direct packet read for LWT_OUT",
+   .insns = {
+   BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1,
+   offsetof(struct __sk_buff, data)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_1,
+   offsetof(struct __sk_buff, data_end)),
+   BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
+   BPF_JMP_REG(BPF_JGT, BPF_REG_0, BPF_REG_3, 1),
+   

Re: [PATCH net-next v4 3/4] bpf: BPF for lightweight tunnel infrastructure

2016-12-01 Thread Thomas Graf
On 12/01/16 at 01:08pm, Daniel Borkmann wrote:
> For the verifier change in may_access_direct_pkt_data(), would be
> great if you could later on follow up with a selftest-suite case,
> one where BPF_PROG_TYPE_LWT_IN/OUT prog tries to write and fails,
> and one where BPF_PROG_TYPE_LWT_IN/OUT prog uses pkt data to pass
> to helpers, for example, so that we can keep testing it when future
> changes in that area are made. Thanks.

Good idea, will do.


Re: [flamebait] xdp, well meaning but pointless

2016-12-01 Thread Thomas Graf
On 12/01/16 at 04:52pm, Hannes Frederic Sowa wrote:
> First of all, this is a rant targeted at XDP and not at eBPF as a whole.
> XDP manipulates packets at free will and thus all security guarantees
> are off as well as in any user space solution.
> 
> Secondly user space provides policy, acl, more controlled memory
> protection, restartability and better debugability. If I had multi
> tenant workloads I would definitely put more complex "business/acl"
> logic into user space, so I can make use of LSM and other features to
> especially prevent a network facing service to attack the tenants. If
> stuff gets put into the kernel you run user controlled code in the
> kernel exposing a much bigger attack vector.
> 
> What use case do you see in XDP specifically e.g. for container networking?

DDOS mitigation to protect distributed applications in large clusters.
Relying on CDN works to protect API gateways and frontends (as long as
they don't throw you out of their network) but offers no protection
beyond that, e.g. a noisy/hostile neighbour. Doing this at the server
level and allowing the mitigation capability to scale up with the number
of servers is natural and cheap.

> > I agree with you if the LB is a software based appliance in either a
> > dedicated VM or on dedicated baremetal.
> > 
> > The reality is turning out to be different in many cases though, LB
> > needs to be performed not only for north south but east west as well.
> > So even if I would handle LB for traffic entering my datacenter in user
> > space, I will need the same LB for packets from my applications and
> > I definitely don't want to move all of that into user space.
> 
> The open question to me is why is programmability needed here.
> 
> Look at the discussion about ECMP and consistent hashing. It is not very
> easy to actually write this code correctly. Why can't we just put C code
> into the kernel that implements this once and for all and let user space
> update the policies?

Whatever LB logic is put in place with native C code now is unlikely the
logic we need in two years. We can't really predict the future. If it
was the case, networking would have been done long ago and we would all
be working on self eating ice cream now.

> Load balancers have to deal correctly with ICMP packets, e.g. they even
> have to be duplicated to every ECMP route. This seems to be problematic
> to do in eBPF programs due to looping constructs so you end up with
> complicated user space anyway.

Feel free to implement such complex LBs in user space or natively. It is
not required for the majority of use cases. The most popular LBs for
application load balancing have no idea of ECMP and require ECMP aware
routers to be made redundant itself.


Re: [flamebait] xdp, well meaning but pointless

2016-12-01 Thread Thomas Graf
On 12/01/16 at 10:11am, Florian Westphal wrote:
> Aside from this, XDP, like DPDK, is a kernel bypass.
> You might say 'Its just stack bypass, not a kernel bypass!'.
> But what does that mean exactly?  That packets can still be passed
> onward to normal stack?
> Bypass solutions like netmap can also inject packets back to
> kernel stack again.

I have a fundamental issue with the approach of exporting packets into
user space and reinjecting them: Once the packet leaves the kernel,
any security guarantees are off. I have no control over what is
running in user space and whether whatever listener up there has been
compromised or not. To me, that's a no go, in particular for servers
hosting multi tenant workloads. This is one of the main reasons why
XDP, in particular in combination with BPF, is very interesting to me.

> b). with regards to a programmable data path: IFF one wants to do this
> in kernel (and thats a big if), it seems much more preferrable to provide
> a config/data-based approach rather than a programmable one.  If you want
> full freedom DPDK is architecturally just too powerful to compete with.

I must have missed the legal disclaimer that is usually put in front
of the DPDK marketing show :-)

I don't want full freedom. I want programmability with stack integration
at sufficient speed and the ability to benefit from the hardware
abstractions that the kernel provides.

> Proponents of XDP sometimes provide usage examples.
> Lets look at some of these.

[ I won't comment on any of the other use cases because they are of no
  interest to me ]

> * Load balancer
> State holding algorithm need sorting and searching, so also no fit for
> eBPF (could be exposed by function exports, but then can we do DoS by
> finding worst case scenarios?).
> 
> Also again needs way to forward frame out via another interface.
> 
> For cases where packet gets sent out via same interface it would appear
> to be easier to use port mirroring in a switch and use stochastic filtering
> on end nodes to determine which host should take responsibility.
> 
> XDP plus: central authority over how distribution will work in case
> nodes are added/removed from pool.
> But then again, it will be easier to hande this with netmap/dpdk where
> more complicated scheduling algorithms can be used.

I agree with you if the LB is a software based appliance in either a
dedicated VM or on dedicated baremetal.

The reality is turning out to be different in many cases though, LB
needs to be performed not only for north south but east west as well.
So even if I would handle LB for traffic entering my datacenter in user
space, I will need the same LB for packets from my applications and
I definitely don't want to move all of that into user space.

> * early drop/filtering.
> While its possible to do "u32" like filters with ebpf, all modern nics
> support ntuple filtering in hardware, which is going to be faster because
> such packet will never even be signalled to the operating system.
> For more complicated cases (e.g. doing socket lookup to check if particular
> packet does match bound socket (and expected sequence numbers etc) I don't
> see easy ways to do that with XDP (and without sk_buff context).
> Providing it via function exports is possible of course, but that will only
> result in an "arms race" where we will see special-sauce functions
> all over the place -- DoS will always attempt to go for something
> that is difficult to filter against, cf. all the recent volume-based
> floodings.

You probably put this last because this was the most difficult to
shoot down ;-)

The benefits of XDP for this use case are extremely obvious in combination
with local applications which need to be protected. ntuple filters won't
cut it. They are limited and subject to a certain rate at which they
can be configured. Any serious mitigation will require stateful filtering
with at least minimal L7 matching abilities and this is exactly where XDP
will excel.


[PATCH net-next v4 0/4] bpf: BPF for lightweight tunnel encapsulation

2016-11-30 Thread Thomas Graf
This series implements BPF program invocation from dst entries via the
lightweight tunnels infrastructure. The BPF program can be attached to
lwtunnel_input(), lwtunnel_output() or lwtunnel_xmit() and see an L3
skb as context. Programs attached to input and output are read-only.
Programs attached to lwtunnel_xmit() can modify and redirect, push headers
and redirect packets.

The facility can be used to:
 - Collect statistics and generate sampling data for a subset of traffic
   based on the dst utilized by the packet thus allowing to extend the
   existing realms.
 - Apply additional per route/dst filters to prohibit certain outgoing
   or incoming packets based on BPF filters. In particular, this allows
   to maintain per dst custom state across multiple packets in BPF maps
   and apply filters based on statistics and behaviour observed over time.
 - Attachment of L2 headers at transmit where resolving the L2 address
   is not required.
 - Possibly many more.

v3 -> v4:
 - Bumped LWT_BPF_MAX_HEADROOM from 128 to 256 (Alexei)
 - Renamed bpf_skb_push() helper to bpf_skb_change_head() to relate to
   existing bpf_skb_change_tail() helper (Alexei/Daniel)
 - Added check in __bpf_redirect_common() to verify that program added a
   link header before redirecting to a l2 device. Adding the check to
   lwt-bpf code was considered but dropped due to massive code required
   due to retrieval of net_device via per-cpu redirect buffer. A test
   case was added to cover the scenario when a program directs to an l2
   device without adding an appropriate l2 header.
   (Alexei)
 - Prohibited access to tc_classid (Daniel)
 - Collapsed bpf_verifier_ops instance for lwt in/out as they are
   identical (Daniel)
 - Some cosmetic changes

v2 -> v3:
 - Added real world sample lwt_len_hist_kern.c which demonstrates how to
   collect a histogram on packet sizes for all packets flowing through
   a number of routes.
 - Restricted output to be read-only. Since the header can no longer
   be modified, the rerouting functionality has been removed again.
 - Added test case which cover destructive modification of packet data.

v1 -> v2:
 - Added new BPF_LWT_REROUTE return code for program to indicate
   that new route lookup should be performed. Suggested by Tom.
 - New sample to illustrate rerouting
 - New patch 05: Recursion limit for lwtunnel_output for the case
   when user creates circular dst redirection. Also resolves the
   issue for ILA.
 - Fix to ensure headroom for potential future L2 header is still
   guaranteed

Thomas Graf (4):
  route: Set orig_output when redirecting to lwt on locally generated
traffic
  route: Set lwtstate for local traffic and cached input dsts
  bpf: BPF for lightweight tunnel infrastructure
  bpf: Add tests and samples for LWT-BPF

 include/linux/filter.h  |   2 +-
 include/uapi/linux/bpf.h|  32 +++-
 include/uapi/linux/lwtunnel.h   |  23 +++
 kernel/bpf/verifier.c   |  14 +-
 net/Kconfig |   8 +
 net/core/Makefile   |   1 +
 net/core/filter.c   | 173 +
 net/core/lwt_bpf.c  | 396 +++
 net/core/lwtunnel.c |   2 +
 net/ipv4/route.c|  37 ++--
 samples/bpf/Makefile|   4 +
 samples/bpf/bpf_helpers.h   |   4 +
 samples/bpf/lwt_len_hist.sh |  37 
 samples/bpf/lwt_len_hist_kern.c |  82 +
 samples/bpf/lwt_len_hist_user.c |  76 
 samples/bpf/test_lwt_bpf.c  | 253 +
 samples/bpf/test_lwt_bpf.sh | 399 
 17 files changed, 1527 insertions(+), 16 deletions(-)
 create mode 100644 net/core/lwt_bpf.c
 create mode 100755 samples/bpf/lwt_len_hist.sh
 create mode 100644 samples/bpf/lwt_len_hist_kern.c
 create mode 100644 samples/bpf/lwt_len_hist_user.c
 create mode 100644 samples/bpf/test_lwt_bpf.c
 create mode 100755 samples/bpf/test_lwt_bpf.sh

-- 
2.7.4



[PATCH net-next v4 2/4] route: Set lwtstate for local traffic and cached input dsts

2016-11-30 Thread Thomas Graf
A route on the output path hitting a RTN_LOCAL route will keep the dst
associated on its way through the loopback device. On the receive path,
the dst_input() call will thus invoke the input handler of the route
created in the output path. Thus, lwt redirection for input must be done
for dsts allocated in the otuput path as well.

Also, if a route is cached in the input path, the allocated dst should
respect lwtunnel configuration on the nexthop as well.

Signed-off-by: Thomas Graf <tg...@suug.ch>
---
 net/ipv4/route.c | 39 ++-
 1 file changed, 26 insertions(+), 13 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 0c39b62..f4c4d7a 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1602,6 +1602,19 @@ static void ip_del_fnhe(struct fib_nh *nh, __be32 daddr)
spin_unlock_bh(_lock);
 }
 
+static void set_lwt_redirect(struct rtable *rth)
+{
+   if (lwtunnel_output_redirect(rth->dst.lwtstate)) {
+   rth->dst.lwtstate->orig_output = rth->dst.output;
+   rth->dst.output = lwtunnel_output;
+   }
+
+   if (lwtunnel_input_redirect(rth->dst.lwtstate)) {
+   rth->dst.lwtstate->orig_input = rth->dst.input;
+   rth->dst.input = lwtunnel_input;
+   }
+}
+
 /* called in rcu_read_lock() section */
 static int __mkroute_input(struct sk_buff *skb,
   const struct fib_result *res,
@@ -1691,14 +1704,7 @@ static int __mkroute_input(struct sk_buff *skb,
rth->dst.input = ip_forward;
 
rt_set_nexthop(rth, daddr, res, fnhe, res->fi, res->type, itag);
-   if (lwtunnel_output_redirect(rth->dst.lwtstate)) {
-   rth->dst.lwtstate->orig_output = rth->dst.output;
-   rth->dst.output = lwtunnel_output;
-   }
-   if (lwtunnel_input_redirect(rth->dst.lwtstate)) {
-   rth->dst.lwtstate->orig_input = rth->dst.input;
-   rth->dst.input = lwtunnel_input;
-   }
+   set_lwt_redirect(rth);
skb_dst_set(skb, >dst);
 out:
err = 0;
@@ -1925,8 +1931,18 @@ out: return err;
rth->dst.error= -err;
rth->rt_flags   &= ~RTCF_LOCAL;
}
+
if (do_cache) {
-   if (unlikely(!rt_cache_route(_RES_NH(res), rth))) {
+   struct fib_nh *nh = _RES_NH(res);
+
+   rth->dst.lwtstate = lwtstate_get(nh->nh_lwtstate);
+   if (lwtunnel_input_redirect(rth->dst.lwtstate)) {
+   WARN_ON(rth->dst.input == lwtunnel_input);
+   rth->dst.lwtstate->orig_input = rth->dst.input;
+   rth->dst.input = lwtunnel_input;
+   }
+
+   if (unlikely(!rt_cache_route(nh, rth))) {
rth->dst.flags |= DST_NOCACHE;
rt_add_uncached_list(rth);
}
@@ -2154,10 +2170,7 @@ static struct rtable *__mkroute_output(const struct 
fib_result *res,
}
 
rt_set_nexthop(rth, fl4->daddr, res, fnhe, fi, type, 0);
-   if (lwtunnel_output_redirect(rth->dst.lwtstate)) {
-   rth->dst.lwtstate->orig_output = rth->dst.output;
-   rth->dst.output = lwtunnel_output;
-   }
+   set_lwt_redirect(rth);
 
return rth;
 }
-- 
2.7.4



[PATCH net-next v4 3/4] bpf: BPF for lightweight tunnel infrastructure

2016-11-30 Thread Thomas Graf
Registers new BPF program types which correspond to the LWT hooks:
  - BPF_PROG_TYPE_LWT_IN   => dst_input()
  - BPF_PROG_TYPE_LWT_OUT  => dst_output()
  - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()

The separate program types are required to differentiate between the
capabilities each LWT hook allows:

 * Programs attached to dst_input() or dst_output() are restricted and
   may only read the data of an skb. This prevent modification and
   possible invalidation of already validated packet headers on receive
   and the construction of illegal headers while the IP headers are
   still being assembled.

 * Programs attached to lwtunnel_xmit() are allowed to modify packet
   content as well as prepending an L2 header via a newly introduced
   helper bpf_skb_change_head(). This is safe as lwtunnel_xmit() is
   invoked after the IP header has been assembled completely.

All BPF programs receive an skb with L3 headers attached and may return
one of the following error codes:

 BPF_OK - Continue routing as per nexthop
 BPF_DROP - Drop skb and return EPERM
 BPF_REDIRECT - Redirect skb to device as per redirect() helper.
(Only valid in lwtunnel_xmit() context)

The return codes are binary compatible with their TC_ACT_
relatives to ease compatibility.

Signed-off-by: Thomas Graf <tg...@suug.ch>
---
 include/linux/filter.h|   2 +-
 include/uapi/linux/bpf.h  |  32 +++-
 include/uapi/linux/lwtunnel.h |  23 +++
 kernel/bpf/verifier.c |  14 +-
 net/Kconfig   |   8 +
 net/core/Makefile |   1 +
 net/core/filter.c | 173 ++
 net/core/lwt_bpf.c| 396 ++
 net/core/lwtunnel.c   |   2 +
 9 files changed, 646 insertions(+), 5 deletions(-)
 create mode 100644 net/core/lwt_bpf.c

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 7f246a2..7ba6446 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -438,7 +438,7 @@ struct xdp_buff {
 };
 
 /* compute the linear packet data range [data, data_end) which
- * will be accessed by cls_bpf and act_bpf programs
+ * will be accessed by cls_bpf, act_bpf and lwt programs
  */
 static inline void bpf_compute_data_end(struct sk_buff *skb)
 {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1370a9d..22ac827 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -101,6 +101,9 @@ enum bpf_prog_type {
BPF_PROG_TYPE_XDP,
BPF_PROG_TYPE_PERF_EVENT,
BPF_PROG_TYPE_CGROUP_SKB,
+   BPF_PROG_TYPE_LWT_IN,
+   BPF_PROG_TYPE_LWT_OUT,
+   BPF_PROG_TYPE_LWT_XMIT,
 };
 
 enum bpf_attach_type {
@@ -409,6 +412,16 @@ union bpf_attr {
  *
  * int bpf_get_numa_node_id()
  * Return: Id of current NUMA node.
+ *
+ * int bpf_skb_change_head()
+ * Grows headroom of skb and adjusts MAC header offset accordingly.
+ * Will extends/reallocae as required automatically.
+ * May change skb data pointer and will thus invalidate any check
+ * performed for direct packet access.
+ * @skb: pointer to skb
+ * @len: length of header to be pushed in front
+ * @flags: Flags (unused for now)
+ * Return: 0 on success or negative error
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -453,7 +466,8 @@ union bpf_attr {
FN(skb_pull_data),  \
FN(csum_update),\
FN(set_hash_invalid),   \
-   FN(get_numa_node_id),
+   FN(get_numa_node_id),   \
+   FN(skb_change_head),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -537,6 +551,22 @@ struct bpf_tunnel_key {
__u32 tunnel_label;
 };
 
+/* Generic BPF return codes which all BPF program types may support.
+ * The values are binary compatible with their TC_ACT_* counter-part to
+ * provide backwards compatibility with existing SCHED_CLS and SCHED_ACT
+ * programs.
+ *
+ * XDP is handled seprately, see XDP_*.
+ */
+enum bpf_ret_code {
+   BPF_OK = 0,
+   /* 1 reserved */
+   BPF_DROP = 2,
+   /* 3-6 reserved */
+   BPF_REDIRECT = 7,
+   /* >127 are reserved for prog type specific return codes */
+};
+
 /* User return codes for XDP prog type.
  * A valid XDP program must return one of these defined values. All other
  * return codes are reserved for future use. Unknown return codes will result
diff --git a/include/uapi/linux/lwtunnel.h b/include/uapi/linux/lwtunnel.h
index 453cc62..92724cba 100644
--- a/include/uapi/linux/lwtunnel.h
+++ b/include/uapi/linux/lwtunnel.h
@@ -10,6 +10,7 @@ enum lwtunnel_encap_types {
LWTUNNEL_ENCAP_ILA,
LWTUNNEL_ENCAP_IP6,
LWTUNNEL_ENCAP_SEG6,
+   LWTUNNEL_ENCAP_BPF,
__LWTUNNEL_ENCAP_MAX,
 };
 
@@ -43,4 +44,26 @@ enum lwtunnel_ip6_t {
 
 #define LWTUNNEL_IP6_MAX (__LWTUNNEL_IP

[PATCH net-next v4 4/4] bpf: Add tests and samples for LWT-BPF

2016-11-30 Thread Thomas Graf
Adds a series of tests to verify the functionality of attaching
BPF programs at LWT hooks.

Also adds a sample which collects a histogram of packet sizes which
pass through an LWT hook.

$ ./lwt_len_hist.sh
Starting netserver with host 'IN(6)ADDR_ANY' port '12865' and family AF_UNSPEC
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.253.2 
() port 0 AF_INET : demo
Recv   SendSend
Socket Socket  Message  Elapsed
Size   SizeSize Time Throughput
bytes  bytes   bytessecs.10^6bits/sec

 87380  16384  1638410.0039857.69
   1 -> 1: 0|  |
   2 -> 3: 0|  |
   4 -> 7: 0|  |
   8 -> 15   : 0|  |
  16 -> 31   : 0|  |
  32 -> 63   : 22   |  |
  64 -> 127  : 98   |  |
 128 -> 255  : 213  |  |
 256 -> 511  : 1444251  |  |
 512 -> 1023 : 660610   |***   |
1024 -> 2047 : 535241   |**|
2048 -> 4095 : 19   |  |
4096 -> 8191 : 180  |  |
8192 -> 16383: 5578023  |* |
   16384 -> 32767: 632099   |***   |
   32768 -> 65535: 6575 |      |

Signed-off-by: Thomas Graf <tg...@suug.ch>
---
 samples/bpf/Makefile|   4 +
 samples/bpf/bpf_helpers.h   |   4 +
 samples/bpf/lwt_len_hist.sh |  37 
 samples/bpf/lwt_len_hist_kern.c |  82 +
 samples/bpf/lwt_len_hist_user.c |  76 
 samples/bpf/test_lwt_bpf.c  | 253 +
 samples/bpf/test_lwt_bpf.sh | 399 
 7 files changed, 855 insertions(+)
 create mode 100755 samples/bpf/lwt_len_hist.sh
 create mode 100644 samples/bpf/lwt_len_hist_kern.c
 create mode 100644 samples/bpf/lwt_len_hist_user.c
 create mode 100644 samples/bpf/test_lwt_bpf.c
 create mode 100755 samples/bpf/test_lwt_bpf.sh

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 22b6407e..3161f82 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -29,6 +29,7 @@ hostprogs-y += test_current_task_under_cgroup
 hostprogs-y += trace_event
 hostprogs-y += sampleip
 hostprogs-y += tc_l2_redirect
+hostprogs-y += lwt_len_hist
 
 test_lru_dist-objs := test_lru_dist.o libbpf.o
 sock_example-objs := sock_example.o libbpf.o
@@ -59,6 +60,7 @@ test_current_task_under_cgroup-objs := bpf_load.o libbpf.o \
 trace_event-objs := bpf_load.o libbpf.o trace_event_user.o
 sampleip-objs := bpf_load.o libbpf.o sampleip_user.o
 tc_l2_redirect-objs := bpf_load.o libbpf.o tc_l2_redirect_user.o
+lwt_len_hist-objs := bpf_load.o libbpf.o lwt_len_hist_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -89,6 +91,7 @@ always += xdp2_kern.o
 always += test_current_task_under_cgroup_kern.o
 always += trace_event_kern.o
 always += sampleip_kern.o
+always += lwt_len_hist_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(objtree)/tools/testing/selftests/bpf/
@@ -117,6 +120,7 @@ HOSTLOADLIBES_test_current_task_under_cgroup += -lelf
 HOSTLOADLIBES_trace_event += -lelf
 HOSTLOADLIBES_sampleip += -lelf
 HOSTLOADLIBES_tc_l2_redirect += -l elf
+HOSTLOADLIBES_lwt_len_hist += -l elf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index 90f44bd..a246c61 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -80,6 +80,8 @@ struct bpf_map_def {
unsigned int map_flags;
 };
 
+static int (*bpf_skb_load_bytes)(void *ctx, int off, void *to, int len) =
+   (void *) BPF_FUNC_skb_load_bytes;
 static int (*bpf_skb_store_bytes)(void *ctx, int off, void *from, int len, int 
flags) =
(void *) BPF_FUNC_skb_store_bytes;
 static int (*bpf_l3_csum_replace)(void *ctx, int off, int from, int to, int 
flags) =
@@ -88,6 +90,8 @@ static int (*bpf_l4_csum_replace)(void *ctx, int off, int 
from, int to, int flag
(void *) BPF_FUNC_l4_csum_replace;
 static int (*bpf_skb_under_cgroup)(void *ctx, void *map, int index) =
(void *) BPF_FUNC_skb_under_cgroup;
+static int (*bpf_skb_change_head)(void *, int len, int flags) =
+   (void *) BPF_FUNC_skb_change_head;
 
 #if defined(__x86_64__)
 
diff --git a/samples/bpf/lwt_len_hist.sh b/

[PATCH net-next v4 1/4] route: Set orig_output when redirecting to lwt on locally generated traffic

2016-11-30 Thread Thomas Graf
orig_output for IPv4 was only set for dsts which hit an input route.
Set it consistently for locally generated traffic as well to allow
lwt to continue the dst_output() path as configured by the nexthop.

Fixes: 2536862311d ("lwt: Add support to redirect dst.input")
Signed-off-by: Thomas Graf <tg...@suug.ch>
---
 net/ipv4/route.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index d37fc6f..0c39b62 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2154,8 +2154,10 @@ static struct rtable *__mkroute_output(const struct 
fib_result *res,
}
 
rt_set_nexthop(rth, fl4->daddr, res, fnhe, fi, type, 0);
-   if (lwtunnel_output_redirect(rth->dst.lwtstate))
+   if (lwtunnel_output_redirect(rth->dst.lwtstate)) {
+   rth->dst.lwtstate->orig_output = rth->dst.output;
rth->dst.output = lwtunnel_output;
+   }
 
return rth;
 }
-- 
2.7.4



Re: [PATCH net-next v3 3/4] bpf: BPF for lightweight tunnel infrastructure

2016-11-30 Thread Thomas Graf
On 11/29/16 at 11:01pm, Alexei Starovoitov wrote:
> On Wed, Nov 30, 2016 at 07:48:51AM +0100, Thomas Graf wrote:
> > Should we check in __bpf_redirect_common() whether mac_header <
> > nework_header then or add it to lwt-bpf conditional on
> > dev_is_mac_header_xmit()?
> 
> may be only extra 'if' in lwt-bpf is all we need?

Agreed, I will add a mac_header < network_header check to lwt-bpf if we
redirect to an l2 device.

> I'm still missing what will happen if we 'forget' to do
> bpf_skb_push() inside the lwt-bpf program, but still do redirect
> in lwt_xmit stage to l2 netdev...

The same as for a AF_PACKET socket not providing an actual L2 header.
I will add a test case to cover this scenario as well.


Re: [PATCH net-next v3 4/4] bpf: Add tests and samples for LWT-BPF

2016-11-29 Thread Thomas Graf
On 11/29/16 at 04:17pm, Alexei Starovoitov wrote:
> On Tue, Nov 29, 2016 at 02:21:23PM +0100, Thomas Graf wrote:
> > Adds a series of test to verify the functionality of attaching
> > BPF programs at LWT hooks.
> > 
> > Also adds a sample which collects a histogram of packet sizes which
> > pass through an LWT hook.
> > 
> > $ ./lwt_len_hist.sh
> > Starting netserver with host 'IN(6)ADDR_ANY' port '12865' and family 
> > AF_UNSPEC
> > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
> > 192.168.253.2 () port 0 AF_INET : demo
> > Recv   SendSend
> > Socket Socket  Message  Elapsed
> > Size   SizeSize Time Throughput
> > bytes  bytes   bytessecs.10^6bits/sec
> > 
> >  87380  16384  1638410.0039857.69
> 
> Nice!
> 
> > +   ret = bpf_redirect(ifindex, 0);
> > +   if (ret < 0) {
> > +   printk("bpf_redirect() failed: %d\n", ret);
> > +   return BPF_DROP;
> > +   }
> 
> this 'if' looks a bit weird. You're passing 0 as flags,
> so this helper will always succeed.
> Other sample code often does 'return bpf_redirect(...)'
> due to this reasoning.

Right, the if branch is absolutely useless. I will remove it.


Re: [PATCH net-next v3 3/4] bpf: BPF for lightweight tunnel infrastructure

2016-11-29 Thread Thomas Graf
On 11/29/16 at 04:15pm, Alexei Starovoitov wrote:
> On Tue, Nov 29, 2016 at 02:21:22PM +0100, Thomas Graf wrote:
> ...
> > +#define LWT_BPF_MAX_HEADROOM 128
> 
> why 128?
> btw I'm thinking for XDP to use 256, so metadata can be stored in there.

It's an arbitrary limit to catch obvious misconfiguration. I'm absolutely
fine with bumping it to 256.

> > +static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
> > +  struct dst_entry *dst, bool can_redirect)
> > +{
> > +   int ret;
> > +
> > +   /* Preempt disable is needed to protect per-cpu redirect_info between
> > +* BPF prog and skb_do_redirect(). The call_rcu in bpf_prog_put() and
> > +* access to maps strictly require a rcu_read_lock() for protection,
> > +* mixing with BH RCU lock doesn't work.
> > +*/
> > +   preempt_disable();
> > +   rcu_read_lock();
> > +   bpf_compute_data_end(skb);
> > +   ret = BPF_PROG_RUN(lwt->prog, skb);
> > +   rcu_read_unlock();
> > +
> > +   switch (ret) {
> > +   case BPF_OK:
> > +   break;
> > +
> > +   case BPF_REDIRECT:
> > +   if (!can_redirect) {
> > +   WARN_ONCE(1, "Illegal redirect return code in prog 
> > %s\n",
> > + lwt->name ? : "");
> > +   ret = BPF_OK;
> > +   } else {
> > +   ret = skb_do_redirect(skb);
> 
> I think this assumes that program did bpf_skb_push and L2 header is present.
> Would it make sense to check that mac_header < network_header here to make
> sure that it actually happened? I think the cost of single 'if' isn't much.
> Also skb_do_redirect() can redirect to l3 tunnels like ipip ;)
> so program shouldn't be doing bpf_skb_push in such case...

We are currently guaranteeing mac_header <= network_header given that
bpf_skb_push() is calling skb_reset_mac_header() unconditionally.

Even if a program were to push an L2 header and then redirect to an l3
tunnel, __bpf_redirect_no_mac will pull it off again and correct the
mac_header offset.

Should we check in __bpf_redirect_common() whether mac_header <
nework_header then or add it to lwt-bpf conditional on
dev_is_mac_header_xmit()?

> May be rename bpf_skb_push to bpf_skb_push_l2 ?
> since it's doing skb_reset_mac_header(skb); at the end of it?
> Or it's probably better to use 'flags' argument to tell whether
> bpf_skb_push() should set mac_header or not ?

I added the flags for this purpose but left it unused for now. This
will allow to add a flag later to extend the l3 header prior to
pushing an l2 header, hence the current helper name. But even then,
we would always reset the mac header as well to ensure we never
leave an skb with mac_header > network_header. Are you seeing a
scenario where we would want to omit the mac header reset?

> Then this bit:
> 
> > +   case BPF_OK:
> > +   /* If the L3 header was expanded, headroom might be too
> > +* small for L2 header now, expand as needed.
> > +*/
> > +   ret = xmit_check_hhlen(skb);
> 
> will work fine as well...
> which probably needs "mac_header wasn't set" check? or it's fine?

Right. The check is not strictly required right now but is a safety net
to ensure that the skb passed to dst_neigh_output() which will add the
l2 header in the non-redirect case to always have sufficient headroom.
It will currently be a NOP as we are not allowing to extend the l3 header
in bpf_skb_push() yet. I left this in there to ensure that we are not
missing to add this later.


Re: [PATCH v3 net-next 0/4] bpf: BPF for lightweight tunnel encapsulation

2016-11-29 Thread Thomas Graf
Hi Hannes,

On 11/29/16 at 03:15pm, Hannes Frederic Sowa wrote:
> Did you look at the cgroup based hooks which were added recently in
> ip_finish_output for cgroup ebpf support and in general the cgroup bpf
> subsystem. Does some of this solve the problem for you already? Would be
> interesting to hear your opinion on that.

What I'm looking for is the ability to collect statistics and generate
samples for a subset of the traffic, e.g. all intra data center
traffic, all packets hitting the default route in a network namespace,
all packets which use a dst tying a certain endpoint to particular TCP
metrics. For the examples above, LWT provides a very intuitive and
natural way to do so while amortizing the cost of the route lookup
which is required anyway.

The cgroup hook provides similar semantics but if the application
context is of interest. Obviously, tasks in a cgroup may be sharing
routes so I can't use it as a replacement. However, using the two in
combination will become highly useful as it allows to gather statistics
individually for both application context and routing context and then
aggregate them to see how applications are using different network
segments.

Aside from the different context matching, the cgroup hook will not
allow to modify the packet as the lwtunnel_xmit() post
ip_finish_output does.


[PATCH net-next v3 1/4] route: Set orig_output when redirecting to lwt on locally generated traffic

2016-11-29 Thread Thomas Graf
orig_output for IPv4 was only set for dsts which hit an input route.
Set it consistently for locally generated traffic as well to allow
lwt to continue the dst_output() path as configured by the nexthop.

Fixes: 2536862311d ("lwt: Add support to redirect dst.input")
Signed-off-by: Thomas Graf <tg...@suug.ch>
---
 net/ipv4/route.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index d37fc6f..0c39b62 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2154,8 +2154,10 @@ static struct rtable *__mkroute_output(const struct 
fib_result *res,
}
 
rt_set_nexthop(rth, fl4->daddr, res, fnhe, fi, type, 0);
-   if (lwtunnel_output_redirect(rth->dst.lwtstate))
+   if (lwtunnel_output_redirect(rth->dst.lwtstate)) {
+   rth->dst.lwtstate->orig_output = rth->dst.output;
rth->dst.output = lwtunnel_output;
+   }
 
return rth;
 }
-- 
2.7.4



[PATCH net-next v3 4/4] bpf: Add tests and samples for LWT-BPF

2016-11-29 Thread Thomas Graf
Adds a series of test to verify the functionality of attaching
BPF programs at LWT hooks.

Also adds a sample which collects a histogram of packet sizes which
pass through an LWT hook.

$ ./lwt_len_hist.sh
Starting netserver with host 'IN(6)ADDR_ANY' port '12865' and family AF_UNSPEC
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.253.2 
() port 0 AF_INET : demo
Recv   SendSend
Socket Socket  Message  Elapsed
Size   SizeSize Time Throughput
bytes  bytes   bytessecs.10^6bits/sec

 87380  16384  1638410.0039857.69
   1 -> 1: 0|  |
   2 -> 3: 0|  |
   4 -> 7: 0|  |
   8 -> 15   : 0|  |
  16 -> 31   : 0|  |
  32 -> 63   : 22   |  |
  64 -> 127  : 98   |  |
 128 -> 255  : 213  |  |
 256 -> 511  : 1444251  |  |
 512 -> 1023 : 660610   |***   |
1024 -> 2047 : 535241   |**|
2048 -> 4095 : 19   |  |
4096 -> 8191 : 180  |  |
8192 -> 16383: 5578023  |* |
   16384 -> 32767: 632099   |***   |
   32768 -> 65535: 6575 |      |

Signed-off-by: Thomas Graf <tg...@suug.ch>
---
 samples/bpf/Makefile|   4 +
 samples/bpf/bpf_helpers.h   |   4 +
 samples/bpf/lwt_len_hist.sh |  37 
 samples/bpf/lwt_len_hist_kern.c |  82 +
 samples/bpf/lwt_len_hist_user.c |  76 
 samples/bpf/test_lwt_bpf.c  | 247 ++
 samples/bpf/test_lwt_bpf.sh | 385 
 7 files changed, 835 insertions(+)
 create mode 100755 samples/bpf/lwt_len_hist.sh
 create mode 100644 samples/bpf/lwt_len_hist_kern.c
 create mode 100644 samples/bpf/lwt_len_hist_user.c
 create mode 100644 samples/bpf/test_lwt_bpf.c
 create mode 100755 samples/bpf/test_lwt_bpf.sh

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 22b6407e..3161f82 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -29,6 +29,7 @@ hostprogs-y += test_current_task_under_cgroup
 hostprogs-y += trace_event
 hostprogs-y += sampleip
 hostprogs-y += tc_l2_redirect
+hostprogs-y += lwt_len_hist
 
 test_lru_dist-objs := test_lru_dist.o libbpf.o
 sock_example-objs := sock_example.o libbpf.o
@@ -59,6 +60,7 @@ test_current_task_under_cgroup-objs := bpf_load.o libbpf.o \
 trace_event-objs := bpf_load.o libbpf.o trace_event_user.o
 sampleip-objs := bpf_load.o libbpf.o sampleip_user.o
 tc_l2_redirect-objs := bpf_load.o libbpf.o tc_l2_redirect_user.o
+lwt_len_hist-objs := bpf_load.o libbpf.o lwt_len_hist_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -89,6 +91,7 @@ always += xdp2_kern.o
 always += test_current_task_under_cgroup_kern.o
 always += trace_event_kern.o
 always += sampleip_kern.o
+always += lwt_len_hist_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(objtree)/tools/testing/selftests/bpf/
@@ -117,6 +120,7 @@ HOSTLOADLIBES_test_current_task_under_cgroup += -lelf
 HOSTLOADLIBES_trace_event += -lelf
 HOSTLOADLIBES_sampleip += -lelf
 HOSTLOADLIBES_tc_l2_redirect += -l elf
+HOSTLOADLIBES_lwt_len_hist += -l elf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index 90f44bd..f34e417 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -80,6 +80,8 @@ struct bpf_map_def {
unsigned int map_flags;
 };
 
+static int (*bpf_skb_load_bytes)(void *ctx, int off, void *to, int len) =
+   (void *) BPF_FUNC_skb_load_bytes;
 static int (*bpf_skb_store_bytes)(void *ctx, int off, void *from, int len, int 
flags) =
(void *) BPF_FUNC_skb_store_bytes;
 static int (*bpf_l3_csum_replace)(void *ctx, int off, int from, int to, int 
flags) =
@@ -88,6 +90,8 @@ static int (*bpf_l4_csum_replace)(void *ctx, int off, int 
from, int to, int flag
(void *) BPF_FUNC_l4_csum_replace;
 static int (*bpf_skb_under_cgroup)(void *ctx, void *map, int index) =
(void *) BPF_FUNC_skb_under_cgroup;
+static int (*bpf_skb_push)(void *, int len, int flags) =
+   (void *) BPF_FUNC_skb_push;
 
 #if defined(__x86_64__)
 
diff --git a/samples/bpf/lwt_len_hist.sh b/samples/bpf/lwt_l

[PATCH net-next v3 3/4] bpf: BPF for lightweight tunnel infrastructure

2016-11-29 Thread Thomas Graf
Registers new BPF program types which correspond to the LWT hooks:
  - BPF_PROG_TYPE_LWT_IN   => dst_input()
  - BPF_PROG_TYPE_LWT_OUT  => dst_output()
  - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()

The separate program types are required to differentiate between the
capabilities each LWT hook allows:

 * Programs attached to dst_input() or dst_output() are restricted and
   may only read the data of an skb. This prevent modification and
   possible invalidation of already validated packet headers on receive
   and the construction of illegal headers while the IP headers are
   still being assembled.

 * Programs attached to lwtunnel_xmit() are allowed to modify packet
   content as well as prepending an L2 header via a newly introduced
   helper bpf_skb_push(). This is safe as lwtunnel_xmit() is invoked
   after the IP header has been assembled completely.

All BPF programs receive an skb with L3 headers attached and may return
one of the following error codes:

 BPF_OK - Continue routing as per nexthop
 BPF_DROP - Drop skb and return EPERM
 BPF_REDIRECT - Redirect skb to device as per redirect() helper.
(Only valid in lwtunnel_xmit() context)

The return codes are binary compatible with their TC_ACT_
relatives to ease compatibility.

Signed-off-by: Thomas Graf <tg...@suug.ch>
---
 include/linux/filter.h|   2 +-
 include/uapi/linux/bpf.h  |  32 +++-
 include/uapi/linux/lwtunnel.h |  23 +++
 kernel/bpf/verifier.c |  14 +-
 net/Kconfig   |   8 +
 net/core/Makefile |   1 +
 net/core/filter.c | 148 +++-
 net/core/lwt_bpf.c| 397 ++
 net/core/lwtunnel.c   |   2 +
 9 files changed, 621 insertions(+), 6 deletions(-)
 create mode 100644 net/core/lwt_bpf.c

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 7f246a2..7ba6446 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -438,7 +438,7 @@ struct xdp_buff {
 };
 
 /* compute the linear packet data range [data, data_end) which
- * will be accessed by cls_bpf and act_bpf programs
+ * will be accessed by cls_bpf, act_bpf and lwt programs
  */
 static inline void bpf_compute_data_end(struct sk_buff *skb)
 {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1370a9d..81c02c5 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -101,6 +101,9 @@ enum bpf_prog_type {
BPF_PROG_TYPE_XDP,
BPF_PROG_TYPE_PERF_EVENT,
BPF_PROG_TYPE_CGROUP_SKB,
+   BPF_PROG_TYPE_LWT_IN,
+   BPF_PROG_TYPE_LWT_OUT,
+   BPF_PROG_TYPE_LWT_XMIT,
 };
 
 enum bpf_attach_type {
@@ -409,6 +412,16 @@ union bpf_attr {
  *
  * int bpf_get_numa_node_id()
  * Return: Id of current NUMA node.
+ *
+ * int bpf_skb_push()
+ * Add room to beginning of skb and adjusts MAC header offset accordingly.
+ * Extends/reallocaes for needed skb headeroom automatically.
+ * May change skb data pointer and will thus invalidate any check done
+ * for direct packet access.
+ * @skb: pointer to skb
+ * @len: length of header to be pushed in front
+ * @flags: Flags (unused for now)
+ * Return: 0 on success or negative error
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -453,7 +466,8 @@ union bpf_attr {
FN(skb_pull_data),  \
FN(csum_update),\
FN(set_hash_invalid),   \
-   FN(get_numa_node_id),
+   FN(get_numa_node_id),   \
+   FN(skb_push),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -537,6 +551,22 @@ struct bpf_tunnel_key {
__u32 tunnel_label;
 };
 
+/* Generic BPF return codes which all BPF program types may support.
+ * The values are binary compatible with their TC_ACT_* counter-part to
+ * provide backwards compatibility with existing SCHED_CLS and SCHED_ACT
+ * programs.
+ *
+ * XDP is handled seprately, see XDP_*.
+ */
+enum bpf_ret_code {
+   BPF_OK = 0,
+   /* 1 reserved */
+   BPF_DROP = 2,
+   /* 3-6 reserved */
+   BPF_REDIRECT = 7,
+   /* >127 are reserved for prog type specific return codes */
+};
+
 /* User return codes for XDP prog type.
  * A valid XDP program must return one of these defined values. All other
  * return codes are reserved for future use. Unknown return codes will result
diff --git a/include/uapi/linux/lwtunnel.h b/include/uapi/linux/lwtunnel.h
index 453cc62..937f660 100644
--- a/include/uapi/linux/lwtunnel.h
+++ b/include/uapi/linux/lwtunnel.h
@@ -10,6 +10,7 @@ enum lwtunnel_encap_types {
LWTUNNEL_ENCAP_ILA,
LWTUNNEL_ENCAP_IP6,
LWTUNNEL_ENCAP_SEG6,
+   LWTUNNEL_ENCAP_BPF,
__LWTUNNEL_ENCAP_MAX,
 };
 
@@ -43,4 +44,26 @@ enum lwtunnel_ip6_t {
 
 #define LWTUNNEL_IP6_MAX (__LWTUNNEL_IP6_MAX - 1)
 
+enum {
+   L

[PATCH v3 net-next 0/4] bpf: BPF for lightweight tunnel encapsulation

2016-11-29 Thread Thomas Graf
This series implements BPF program invocation from dst entries via the
lightweight tunnels infrastructure. The BPF program can be attached to
lwtunnel_input(), lwtunnel_output() or lwtunnel_xmit() and see an L3
skb as context. Programs attached to input and output are read-only.
Programs attached to lwtunnel_xmit() can modify and redirect, push headers
and redirect packets.

The facility can be used to:
 - Collect statistics and generate sampling data for a subset of traffic
   based on the dst utilized by the packet thus allowing to extend the
   existing realms.
 - Apply additional per route/dst filters to prohibit certain outgoing
   or incoming packets based on BPF filters. In particular, this allows
   to maintain per dst custom state across multiple packets in BPF maps
   and apply filters based on statistics and behaviour observed over time.
 - Attachment of L2 headers at transmit where resolving the L2 address
   is not required.
 - Possibly many more.

v2 -> v3:
 - Added real world sample lwt_len_hist_kern.c which demonstrates how to
   collect a histogram on packet sizes for all packets flowing through
   a number of routes.
 - Restricted output to be read-only. Since the header can no longer
   be modified, the rerouting functionality has been removed again.
 - Added test case which cover destructive modification of packet data.

v1 -> v2:
 - Added new BPF_LWT_REROUTE return code for program to indicate
   that new route lookup should be performed. Suggested by Tom.
 - New sample to illustrate rerouting
 - New patch 05: Recursion limit for lwtunnel_output for the case
   when user creates circular dst redirection. Also resolves the
   issue for ILA.
 - Fix to ensure headroom for potential future L2 header is still
   guaranteed


Thomas Graf (4):
  route: Set orig_output when redirecting to lwt on locally generated
traffic
  route: Set lwtstate for local traffic and cached input dsts
  bpf: BPF for lightweight tunnel infrastructure
  bpf: Add tests and samples for LWT-BPF

 include/linux/filter.h  |   2 +-
 include/uapi/linux/bpf.h|  32 +++-
 include/uapi/linux/lwtunnel.h   |  23 +++
 kernel/bpf/verifier.c   |  14 +-
 net/Kconfig |   8 +
 net/core/Makefile   |   1 +
 net/core/filter.c   | 148 ++-
 net/core/lwt_bpf.c  | 397 
 net/core/lwtunnel.c |   2 +
 net/ipv4/route.c|  37 ++--
 samples/bpf/Makefile|   4 +
 samples/bpf/bpf_helpers.h   |   4 +
 samples/bpf/lwt_len_hist.sh |  37 
 samples/bpf/lwt_len_hist_kern.c |  82 +
 samples/bpf/lwt_len_hist_user.c |  76 
 samples/bpf/test_lwt_bpf.c  | 247 +
 samples/bpf/test_lwt_bpf.sh | 385 ++
 17 files changed, 1482 insertions(+), 17 deletions(-)
 create mode 100644 net/core/lwt_bpf.c
 create mode 100755 samples/bpf/lwt_len_hist.sh
 create mode 100644 samples/bpf/lwt_len_hist_kern.c
 create mode 100644 samples/bpf/lwt_len_hist_user.c
 create mode 100644 samples/bpf/test_lwt_bpf.c
 create mode 100755 samples/bpf/test_lwt_bpf.sh

-- 
2.7.4



[PATCH net-next v3 2/4] route: Set lwtstate for local traffic and cached input dsts

2016-11-29 Thread Thomas Graf
A route on the output path hitting a RTN_LOCAL route will keep the dst
associated on its way through the loopback device. On the receive path,
the dst_input() call will thus invoke the input handler of the route
created in the output path. Thus, lwt redirection for input must be done
for dsts allocated in the otuput path as well.

Also, if a route is cached in the input path, the allocated dst should
respect lwtunnel configuration on the nexthop as well.

Signed-off-by: Thomas Graf <tg...@suug.ch>
---
 net/ipv4/route.c | 39 ++-
 1 file changed, 26 insertions(+), 13 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 0c39b62..f4c4d7a 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1602,6 +1602,19 @@ static void ip_del_fnhe(struct fib_nh *nh, __be32 daddr)
spin_unlock_bh(_lock);
 }
 
+static void set_lwt_redirect(struct rtable *rth)
+{
+   if (lwtunnel_output_redirect(rth->dst.lwtstate)) {
+   rth->dst.lwtstate->orig_output = rth->dst.output;
+   rth->dst.output = lwtunnel_output;
+   }
+
+   if (lwtunnel_input_redirect(rth->dst.lwtstate)) {
+   rth->dst.lwtstate->orig_input = rth->dst.input;
+   rth->dst.input = lwtunnel_input;
+   }
+}
+
 /* called in rcu_read_lock() section */
 static int __mkroute_input(struct sk_buff *skb,
   const struct fib_result *res,
@@ -1691,14 +1704,7 @@ static int __mkroute_input(struct sk_buff *skb,
rth->dst.input = ip_forward;
 
rt_set_nexthop(rth, daddr, res, fnhe, res->fi, res->type, itag);
-   if (lwtunnel_output_redirect(rth->dst.lwtstate)) {
-   rth->dst.lwtstate->orig_output = rth->dst.output;
-   rth->dst.output = lwtunnel_output;
-   }
-   if (lwtunnel_input_redirect(rth->dst.lwtstate)) {
-   rth->dst.lwtstate->orig_input = rth->dst.input;
-   rth->dst.input = lwtunnel_input;
-   }
+   set_lwt_redirect(rth);
skb_dst_set(skb, >dst);
 out:
err = 0;
@@ -1925,8 +1931,18 @@ out: return err;
rth->dst.error= -err;
rth->rt_flags   &= ~RTCF_LOCAL;
}
+
if (do_cache) {
-   if (unlikely(!rt_cache_route(_RES_NH(res), rth))) {
+   struct fib_nh *nh = _RES_NH(res);
+
+   rth->dst.lwtstate = lwtstate_get(nh->nh_lwtstate);
+   if (lwtunnel_input_redirect(rth->dst.lwtstate)) {
+   WARN_ON(rth->dst.input == lwtunnel_input);
+   rth->dst.lwtstate->orig_input = rth->dst.input;
+   rth->dst.input = lwtunnel_input;
+   }
+
+   if (unlikely(!rt_cache_route(nh, rth))) {
rth->dst.flags |= DST_NOCACHE;
rt_add_uncached_list(rth);
}
@@ -2154,10 +2170,7 @@ static struct rtable *__mkroute_output(const struct 
fib_result *res,
}
 
rt_set_nexthop(rth, fl4->daddr, res, fnhe, fi, type, 0);
-   if (lwtunnel_output_redirect(rth->dst.lwtstate)) {
-   rth->dst.lwtstate->orig_output = rth->dst.output;
-   rth->dst.output = lwtunnel_output;
-   }
+   set_lwt_redirect(rth);
 
return rth;
 }
-- 
2.7.4



Re: [PATCH net-next v4 3/9] ipv6: sr: add support for SRH encapsulation and injection with lwtunnels

2016-11-04 Thread Thomas Graf
On 11/04/16 at 11:29am, David Lebrun wrote:
> +/* insert an SRH within an IPv6 packet, just after the IPv6 header */
> +static int seg6_do_srh_inline(struct sk_buff *skb, struct ipv6_sr_hdr *osrh)
> +{
> + struct ipv6hdr *hdr, *oldhdr;
> + struct ipv6_sr_hdr *isrh;
> + int hdrlen, err;
> +
> + hdrlen = (osrh->hdrlen + 1) << 3;
> +
> + err = pskb_expand_head(skb, hdrlen, 0, GFP_ATOMIC);
> + if (unlikely(err))
> + return err;
> +
> + oldhdr = ipv6_hdr(skb);
> +
> + skb_pull(skb, sizeof(struct ipv6hdr));
> + skb_postpull_rcsum(skb, skb_network_header(skb),
> +sizeof(struct ipv6hdr));
> +
> + skb_push(skb, sizeof(struct ipv6hdr) + hdrlen);
> + skb_reset_network_header(skb);
> + skb_mac_header_rebuild(skb);
> +
> + hdr = ipv6_hdr(skb);
> +
> + memmove(hdr, oldhdr, sizeof(*hdr));
> +
> + isrh = (void *)hdr + sizeof(*hdr);
> + memcpy(isrh, osrh, hdrlen);
> +
> + isrh->nexthdr = hdr->nexthdr;
> + hdr->nexthdr = NEXTHDR_ROUTING;
> +
> + isrh->segments[0] = hdr->daddr;
> + hdr->daddr = isrh->segments[isrh->first_segment];

Where do you verify that isrh->first_segment is not out of bounds?

> + skb_postpush_rcsum(skb, hdr, sizeof(struct ipv6hdr) + hdrlen);
> +
> + return 0;
> +}
> +
> +
> +static int seg6_build_state(struct net_device *dev, struct nlattr *nla,
> + unsigned int family, const void *cfg,
> + struct lwtunnel_state **ts)
> +{
> + struct nlattr *tb[SEG6_IPTUNNEL_MAX + 1];
> + struct seg6_iptunnel_encap *tuninfo;
> + struct lwtunnel_state *newts;
> + struct seg6_lwt *slwt;
> + int tuninfo_len;
> + int err;
> +
> + err = nla_parse_nested(tb, SEG6_IPTUNNEL_MAX, nla,
> +seg6_iptunnel_policy);
> +
> + if (err < 0)
> + return err;
> +
> + if (!tb[SEG6_IPTUNNEL_SRH])
> + return -EINVAL;
> +
> + tuninfo = nla_data(tb[SEG6_IPTUNNEL_SRH]);
> + tuninfo_len = SEG6_IPTUN_ENCAP_SIZE(tuninfo);

Nothing guarantees the size of the Netlink attribute right now. You
need to add a minimal size requirement to seg6_iptunnel_policy and
then check that the additional len provided in the struct itself does
not exceed the Netlink attribute length.


Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-03 Thread Thomas Graf
On 3 November 2016 at 08:52, Hannes Frederic Sowa
<han...@stressinduktion.org> wrote:
> On 02.11.2016 23:54, Thomas Graf wrote:
>> Why would I want to accept the overhead if I simply avoid it? Just
>> parsing the header and doing the hash lookup will add cost, cost for
>> each packet.
>
> That is true, but in case you are outside of the namespace, you still
> have to calculate the cost of doing the FIB lookup for the BPF program
> each time, too.
>
> E.g. given the lookup cost in a hash for a netnwork namespace pointer
> vs. the cost of doing a FIB lookup to get a program that does a specific
> transformation sounds at least in the big O-notiation to be in a better
> place. ;)
>
> If you have to do both anyway, probably your patchset will perform
> better, I agree.

Most containers are unprivileged, the route inside the container's
namespace is owned by the host and we can attach the BPF program
directly to the default route inside the container and all packets
egressing from the container will pass through it. That fib lookup is
needed anyway so we can leverage the cost of that lookup. We can drop
hostile packets early without ever going on L2 level.


Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-03 Thread Thomas Graf
On 2 November 2016 at 04:48, Hannes Frederic Sowa
 wrote:
> On Wed, Nov 2, 2016, at 00:07, Tom Herbert wrote:
>> On the other hand, I'm not really sure how to implement for this level
>> of performance this in LWT+BPF either. It seems like one way to do
>> that would be to create a program each destination and set it each
>> host. As you point out would create a million different programs which
>> doesn't seem manageable. I don't think the BPF map works either since
>> that implies we need a lookup (?). It seems like what we need is one
>> program but allow it to be parameterized with per destination
>> information saved in the route (LWT structure).
>
> Yes, that is my proposal. Just using the dst entry as meta-data (which
> can actually also be an ID for the network namespace the packet is
> coming from).

I have no objection to doing this on top of this series.

> My concern with using BPF is that the rest of the kernel doesn't really
> see the semantics and can't optimize or cache at specific points,
> because the kernel cannot introspect what the BPF program does (for
> metadata manipulation, one can e.g. specifiy that the program is "pure",
> and always provides the same output for some specified given input, thus
> things can be cached and memorized, but that framework seems very hard
> to build).

So you want to reintroduce a routing cache? Each packet needs to pass
through the BPF program anyway for accounting purposes. This is not
just about getting the packets out the right nexthop in the fastest
possible way.

> I also fear this becomes a kernel by-pass:
>
> It might be very hard e.g. to apply NFT/netfilter to such packets, if
> e.g. a redirect happens suddenly and packet flow is diverted from the
> one the user sees currently based on the interfaces and routing tables.

The LWT xmit hook is after the POST_ROUTING hook. The input and output
hook cannot redirect and output will become read-only just like input
already is. We are not bypassing anything. Please stop throwing the
word bypass around. This is just a false claim.

> That's why I am in favor of splitting this patchset down and allow the
> policies that should be expressed by BPF programs being applied to the
> specific subsystems (I am not totally against a generic BPF hook in
> input or output of the protocol engines). E.g. can we deal with static
> rewriting of L2 addresses in the neighbor cache? We already provide a
> fast header cache for L2 data which might be used here?

Split what? What policies?

I have two primary use cases for this:
1) Traffic into local containers: Containers are only supposed to do
L3, all L2 traffic is dropped for security reasons. The L2 header for
any packets in and out of the container is fixed and does not require
any sort of resolving. I in order to feed packets from the local host
into the containers, a route with the container prefix is set up. It
points to a nexthop address which appears behind a veth pair. A BPF
program is listening at tc ingress on the veth pair and will enforce
policies and do accounting. It requires very ugly hacks because Linux
does not like to do forwarding to an address which is considered
local. It works but it is a hack.

What I want to do instead is to run the BPF program for the route
directly, apply the policies, do accounting, push the fixed dummy L2
header and redirect it to the container. If someone has netfilter
rules installed, they will still apply. Nothing is hidden.

2) For external traffic that is coming in. We have a BPF program
listening on tc ingress which matches on the destination address on
all incoming traffic. If the packet is a for a container, we perform
the same actions as above. In this case we are bypassing the routing
table. This is ugly. What I want to do instead is to have the
container prefix invoke the BPF program so all packets have a route
lookup performed and netfilter filtering performed, only after that,
the BPF program is invoked exclusively for the packets destined for
local containers. Yes, it would be possible to redirect into a
temporary veth again and listen on that but it again requires to fake
a L2 segment which is just unnecessary and slow.

This is not hiding anything and it is not bypassing anything.


Re: [PATCH net-next v2 3/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-02 Thread Thomas Graf
On 2 November 2016 at 07:39, Roopa Prabhu  wrote:
>> diff --git a/net/core/Makefile b/net/core/Makefile
>> index d6508c2..a675fd3 100644
>> --- a/net/core/Makefile
>> +++ b/net/core/Makefile
>> @@ -23,7 +23,7 @@ obj-$(CONFIG_NETWORK_PHY_TIMESTAMPING) += timestamping.o
>>  obj-$(CONFIG_NET_PTP_CLASSIFY) += ptp_classifier.o
>>  obj-$(CONFIG_CGROUP_NET_PRIO) += netprio_cgroup.o
>>  obj-$(CONFIG_CGROUP_NET_CLASSID) += netclassid_cgroup.o
>> -obj-$(CONFIG_LWTUNNEL) += lwtunnel.o
>> +obj-$(CONFIG_LWTUNNEL) += lwtunnel.o lwt_bpf.o
>
> Any reason you want to keep lwt bpf under the main CONFIG_LWTUNNEL infra 
> config ?.
> since it is defined as yet another plug-gable encap function, seems like it 
> will be better under a separate
> CONFIG_LWTUNNEL_BPF or CONFIG_LWT_BPF that depends on CONFIG_LWTUNNEL

The code was so minimal with no additional dependencies that I didn't
see a need for a separate Kconfig. I'm fine adding that in the next
iteration though. No objections.


Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-02 Thread Thomas Graf
On 1 November 2016 at 17:07, Tom Herbert  wrote:
> On the other hand, I'm not really sure how to implement for this level
> of performance this in LWT+BPF either. It seems like one way to do
> that would be to create a program each destination and set it each
> host. As you point out would create a million different programs which
> doesn't seem manageable. I don't think the BPF map works either since
> that implies we need a lookup (?). It seems like what we need is one
> program but allow it to be parameterized with per destination
> information saved in the route (LWT structure).

Attaching different BPF programs to millions of unique dsts doesn't
make any sense. That will obivously will never scale and it's not
supposed to scale. This is meant to be used for prefixes which
represent a series of endpoints, f.e. all local containers, all
non-internal traffic, all vpn traffic, etc. I'm also not sure why we
are talking about ILA here, you have written a native implementation,
why would you want to solve it with BPF again?

If you want to run a single program for all dsts, feel free to run the
same BPF program for each dst. Nobody is forcing you to attach
individual programs.


Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-02 Thread Thomas Graf
On 1 November 2016 at 16:12, Hannes Frederic Sowa
<han...@stressinduktion.org> wrote:
> On 01.11.2016 21:59, Thomas Graf wrote:
>>> Dumping and verifying which routes get used might actually already be
>>> quite complex on its own. Thus my fear.
>>
>> We even have an API to query which route is used for a tuple. What
>> else would you like to see?
>
> I am not sure here. Some ideas I had were to allow tcpdump (pf_packet)
> sockets sniff at interfaces and also gather and dump the metadata to
> user space (this would depend on bpf programs only doing the
> modifications in metadata and not in the actual packet).

Not sure I understand. Why does this depend on BPF?

> Or maybe just tracing support (without depending on the eBPF program
> developer to have added debugging in the BPF program).

Absolutely in favour of that.

>> This will be addressed with signing AFAIK.
>
> This sounds a bit unrealistic. Signing lots of small programs can be a
> huge burden to the entity doing the signing (if it is not on the same
> computer). And as far as I understood the programs should be generated
> dynamically?

Right, for generated programs, a hash is a better fit and still sufficient.

>> Would it help if we allow to store the original source used for
>> bytecode generation. What would make it clear which program was used.
>
> I would also be fine with just a strong hash of the bytecode, so the
> program can be identified accurately. Maybe helps with deduplication
> later on, too. ;)

OK, I think we all already agreed on doing this.

> Even though I read through the patchset I am not absolutely sure which
> problem it really solves. Especially because lots of things can be done
> already at the ingress vs. egress interface (I looked at patch 4 but I
> am not sure how realistic they are).

Filtering at egress requires to attach the BPF program to all
potential outgoing interface and then pass every single packet through
the program whereas with LWT BPF, I'm only taking the cost where
actually needed.

>> I also don't see how this could possibly scale if all packets must go
>> through a single BPF program. The overhead will be tremendous if you
>> only want to filter a couple of prefixes.
>
> In case of hash table lookup it should be fast. llvm will probably also
> generate jump table for a few 100 ip addresses, no? Additionally the
> routing table lookup could be not done at all.

Why would I want to accept the overhead if I simply avoid it? Just
parsing the header and doing the hash lookup will add cost, cost for
each packet.


Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-01 Thread Thomas Graf
On 1 November 2016 at 13:33, Tom Herbert  wrote:
> You need to show that is integrity is maintained with these patches.
> Part of this can be done, but part of this needs to be established in
> testing.
>
> I've already given specifics for at least one potential source of
> issues in routing issues. I would like to know what happens if someone
> rewrites an IPv4 packet into IPv6 packet or vice versa. AFAICT we
> would be send an IPv6 using an IPv4 route, or an IPv4 using an IPv6
> route. What is supposed to happen in these circumstances? What
> actually happens?
>
> Similarly, someone overwrites the whole packet with 0xff. What happens
> when we try to send that?

OK, I will add these tests to the selftest in the next iteration.


Re: [PATCH net-next v2 3/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-01 Thread Thomas Graf
On 1 November 2016 at 13:11, David Ahern <d...@cumulusnetworks.com> wrote:
> On 10/31/16 6:37 PM, Thomas Graf wrote:
>>  - Perform simple encapsulation where offload is of no concern.
>>(The existing funtionality to attach a tunnel key to the skb
>> and redirect to a tunnel net_device to allow for offload
>> continues to work obviously).
>
> have you tested the adding of headers with large packets hitting gso and 
> fragmentation? Wondering if mtu is used properly when headers are added and 
> the lwt->headroom is not accounted. See 
> 14972cbd34ff668c390cbd2e6497323484c9e812

Thanks for the pointer David, I will look into this.

The packet size is currently limited to this when adding l2 headers:
skb->dev->mtu + skb->dev->hard_header_len


Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-01 Thread Thomas Graf
On 1 November 2016 at 13:08, Hannes Frederic Sowa
<han...@stressinduktion.org> wrote:
> On Tue, Nov 1, 2016, at 19:51, Thomas Graf wrote:
>> If I understand you correctly then a single BPF program would be
>> loaded which then applies to all dst_output() calls? This has a huge
>> drawback, instead of multiple small BPF programs which do exactly what
>> is required per dst, a large BPF program is needed which matches on
>> metadata. That's way slower and renders one of the biggest advantages
>> of BPF invalid, the ability to generate a a small program tailored to
>> a particular use. See Cilium.
>
> I thought more of hooks in the actual output/input functions specific to
> the protocol type (unfortunately again) protected by jump labels? Those
> hook get part of the dst_entry mapped so they can act on them.

This has no advantage over installing a BPF program at tc egress and
enabling to store/access metadata per dst. The whole point is to
execute bpf for a specific route.

> Another idea would be to put the eBPF hooks into the fib rules
> infrastructure. But I fear this wouldn't get you the hooks you were
> looking for? There they would only end up in the runtime path if
> actually activated.

Use of fib rules kills performance so it's not an option. I'm not even
sure that would be any simpler.

> Dumping and verifying which routes get used might actually already be
> quite complex on its own. Thus my fear.

We even have an API to query which route is used for a tuple. What
else would you like to see?

>> If it's based on metadata then you need to know the program logic and
>> associate it with the metadata in the dst. It actually doesn't get
>> much easier than to debug one of the samples, they are completely
>> static once compiled and it's very simple to verify if they do what
>> they are supposed to do.
>
> At the same time you can have lots of those programs and you e.g. would
> also need to verify if they are acting on the same data structures or
> have the identical code.

This will be addressed with signing AFAIK.

> It all reminds me a bit on grepping in source code which makes heavy use
> of function pointers with very generic and short names.

Is this statement related to routing? I don't get the reference to
function pointers and generic short names.

>> If you like the single program approach, feel free to load the same
>> program for every dst. Perfectly acceptable but I don't see why we
>> should force everybody to use that model.
>
> I am concerned having 100ths of BPF programs, all specialized on a
> particular route, to debug. Looking at one code file and its associated
> tables seems still easier to me.

100 programs != 100 source files. A lot more realistic is a single or
a handful of programs which get compiled for a particular route with
certain pieces enabled/disabled.

> E.g. imaging we have input routes and output routes with different BPF
> programs. We somehow must make sure all nodes kind of behave accordingly
> to "sane" network semantics. If you end up with an input route doing bpf

As soon as we have signing, you can verify your programs in testing,
sign the programs and then quickly verify on all your nodes whether
you are running the correct programs.

Would it help if we allow to store the original source used for
bytecode generation. What would make it clear which program was used.

> processing and the according output node, which e.g. might be needed to
> reflect ICMP packets, doesn't behave accordingly you at least have two
> programs to debug already instead of a switch- or if-condition in one
> single code location. I would like to "force" this kind of symmetry to
> developers using eBPF, thus I thought meta-data manipulation and
> verification inside the kernel would be a better attack at this problem,
> no?

Are you saying you want a single gigantic program for both input and output?

That's not possible. The BPF program has different limitations
depending on where it runs. On input, any write action on the packet
is not allowed, extending the header is only allowed on xmit, and so
on.

I also don't see how this could possibly scale if all packets must go
through a single BPF program. The overhead will be tremendous if you
only want to filter a couple of prefixes.


Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-01 Thread Thomas Graf
On 1 November 2016 at 12:27, Tom Herbert <t...@herbertland.com> wrote:
> On Tue, Nov 1, 2016 at 12:11 PM, Thomas Graf <tg...@suug.ch> wrote:
>> You can do the same with act_pedit or cls_bpf at dev_queue_xmit()
>> before or after you go into the encapsulation device. This is a tool
>> for root, if you install a drop all ssh rule then you won't be able to
>> log into the box anymore. If you modify the packet and render it
>> invalid, the packet will be dropped.
>> If you attach a VLAN header while VLAN offload is already set up, then
>> the hardware will add another VLAN header, this is what I would
>> expect. I truly hope that we don't have code which crashes if we
>> dev_queue_xmit() garbage into any device. That would be horrible.
>>
>> Do you have a specific example that could be a problem?
>
> I believe Alexander was dealing with problems where drivers were not
> properly handling IP extension headers. We regularly get reports about

There are many ways to cause IP extension headers to be inserted. How
is this specific to BPF or this series?

> driver checksum failures on edge conditions. Making a fully functional

Not sure what an "edge condition" is? Untrusted networks? How is
drivers crashing on receive related to this series?

> and efficient transmit data path is nontrivial, there are many
> assumptions being made some documented some not. When drivers crash we
> either fix the driver or the protocol stack that is doing something
> bad.

Tom, we have a dozen ways to modify packet content already and we have
multiple ways to take raw packet data from userspace and inject them
at dev_queue_xmit() level and some even above. What is different for
lwtunnel_xmit()?

This is entirely optional and nobody is forced to use any of this. If
you don't trust the BPF program then you better not run in. It's about
the same as trusting a random tc filter or iptables ruleset. The
important point is that integrity is maintained at all times.I would
love to address any specific concern.


Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-01 Thread Thomas Graf
On 1 November 2016 at 11:51, Tom Herbert <t...@herbertland.com> wrote:
> On Tue, Nov 1, 2016 at 11:20 AM, Thomas Graf <tg...@suug.ch> wrote:
>> The headers cannot be extended or reduced so the offsets always remain
>> correct. What can happen is that the header contains invalid data.
>>
> If we can't add/remove headers then doesn't that really limit the
> utility of these patches? My assumption was that BPF+LWT is needed to
> allow users to define and implement their own encapsulations, EH
> insertion, packet modification, etc.

I thought you were specifically referring to lwtunnel_output(). You
can extend headers at lwtunnel_xmit() but you can't for
lwtunnel_output() and lwtunnel_input(). The xmit hook is in
ip_finish_output2() where it is fine to do so. We have existing code
doing this there.

>> You keep saying this "middle in the stack" but the point is exactly
>> the same as a raw socket with IPPROTO_RAW and hdrincl, see
>> rawv6_sendmsg() and rawv6_send_hdrincl(). An IPv6 raw socket can feed
>> arbitrary garbage into dst_output(). IPv4 does some minimal sanity
>> checks.
>>
> What I mean is that an admin can create a BPF program that run on any
> user packets (for instance default route could be set). This would be
> in the path of TCP, UDP, and other protocols tightly integrated with
> the stack. Packets being routed may be encapsulated, VLAN, have
> checksum offload, GORed set etc. They also might be looped back in
> which case the settings in skbuff become receive parameters.

You can do the same with act_pedit or cls_bpf at dev_queue_xmit()
before or after you go into the encapsulation device. This is a tool
for root, if you install a drop all ssh rule then you won't be able to
log into the box anymore. If you modify the packet and render it
invalid, the packet will be dropped.
If you attach a VLAN header while VLAN offload is already set up, then
the hardware will add another VLAN header, this is what I would
expect. I truly hope that we don't have code which crashes if we
dev_queue_xmit() garbage into any device. That would be horrible.

Do you have a specific example that could be a problem?


Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-01 Thread Thomas Graf
On 1 November 2016 at 03:54, Hannes Frederic Sowa
 wrote:
> I do fear the complexity and debugability introduced by this patch set
> quite a bit.

What is the complexity concern? This is pretty straight forward. I
agree on debugability. This is being worked on separately as Alexei
mentioned, to address this for all BPF integrations.

> I wonder if architecturally it would be more feasible to add a generic
> (bfp-)hook into into dst_output(_sk) and allow arbitrary metadata to be
> added into the dsts.
>
> BPF could then be able to access parts of the metadata in the attached
> metadata dst entries and performing the matching this way?

If I understand you correctly then a single BPF program would be
loaded which then applies to all dst_output() calls? This has a huge
drawback, instead of multiple small BPF programs which do exactly what
is required per dst, a large BPF program is needed which matches on
metadata. That's way slower and renders one of the biggest advantages
of BPF invalid, the ability to generate a a small program tailored to
a particular use. See Cilium.

> The reason why I would prefer an approach like this: irregardless of the
> routing lookup we would process the skb with bpf or not. This gives a
> single point to debug, where instead in your approach we first must
> figure out the corresponding bpf program and then check for it specifically.

Not sure I see what kind of advantage this actually provides. You can
dump the routes and see which programs get invoked and which section.
If it's based on metadata then you need to know the program logic and
associate it with the metadata in the dst. It actually doesn't get
much easier than to debug one of the samples, they are completely
static once compiled and it's very simple to verify if they do what
they are supposed to do.

If you like the single program approach, feel free to load the same
program for every dst. Perfectly acceptable but I don't see why we
should force everybody to use that model.


Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-01 Thread Thomas Graf
On 1 November 2016 at 09:17, Tom Herbert <t...@herbertland.com> wrote:
> On Mon, Oct 31, 2016 at 5:37 PM, Thomas Graf <tg...@suug.ch> wrote:
>> {Open question:
>>  Tom brought up the question on whether it is safe to modify the packet
>>  in artbirary ways before dst_output(). This is the equivalent to a raw
>>  socket injecting illegal headers. This v2 currently assumes that
>>  dst_output() is ready to accept invalid header values. This needs to be
>>  verified and if not the case, then raw sockets or dst_output() handlers
>>  must be fixed as well. Another option is to mark lwtunnel_output() as
>>  read-only for now.}
>>
> The question might not be so much about illegal headers but whether
> fields in the skbuff related to the packet contents are kept correct.
> We have protocol, header offsets, offsets for inner protocols also,
> encapsulation settings, checksum status, checksum offset, checksum

The headers cannot be extended or reduced so the offsets always remain
correct. What can happen is that the header contains invalid data.

> complete value, vlan information. Any or all of which I believe could
> be turned into being incorrect if we allow the packet to be
> arbitrarily modified by BPF. This problem is different than raw
> sockets because LWT operates in the middle of the stack, the skbuff
> has already been set up which such things.

You keep saying this "middle in the stack" but the point is exactly
the same as a raw socket with IPPROTO_RAW and hdrincl, see
rawv6_sendmsg() and rawv6_send_hdrincl(). An IPv6 raw socket can feed
arbitrary garbage into dst_output(). IPv4 does some minimal sanity
checks.

If this is a concern I'm fine with making the dst_output path read-only for now.

>> This series implements BPF program invocation from dst entries via the
>> lightweight tunnels infrastructure. The BPF program can be attached to
>> lwtunnel_input(), lwtunnel_output() or lwtunnel_xmit() and sees an L3
>> skb as context. input is read-only, output can write, xmit can write,
>> push headers, and redirect.
>>
>> Motiviation for this work:
>>  - Restricting outgoing routes beyond what the route tuple supports
>>  - Per route accounting byond realms
>>  - Fast attachment of L2 headers where header does not require resolving
>>L2 addresses
>>  - ILA like uses cases where L3 addresses are resolved and then routed
>>in an async manner
>>  - Fast encapsulation + redirect. For now limited to use cases where not
>>setting inner and outer offset/protocol is OK.
>>
> Is checksum offload supported? By default, at least for Linux, we
> offload the outer UDP checksum in VXLAN and the other UDP
> encapsulations for performance.

No. UDP encap is done by setting a tunnel key through a helper and
letting the encapsulation device handle this. I don't currently see a
point in replicating all of that logic.


Re: [PATCH net-next v2 5/5] lwtunnel: Limit number of recursions on output to 5

2016-11-01 Thread Thomas Graf
On 11/01/16 at 12:52pm, kbuild test robot wrote:
> Hi Thomas,
> 
> [auto build test ERROR on net-next/master]
> 
> url:
> https://github.com/0day-ci/linux/commits/Thomas-Graf/bpf-BPF-for-lightweight-tunnel-encapsulation/20161101-084038
> config: arm64-allmodconfig (attached as .config)
> compiler: aarch64-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
> reproduce:
> wget 
> https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
>  -O ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> make.cross ARCH=arm64 
> 
> All errors (new ones prefixed by >>):
> 
>net/built-in.o: In function `bpf_output':
> >> ncsi-manage.c:(.text+0x8e9f4): undefined reference to 
> >> `ip6_route_output_flags'

Needs
depends on IPV6_MULTIPLE_TABLES || IPV6=n 

Compile testing with IPV6=y and IPV6_MULTIPLE_TABLES=n would have been
great. I'll submit a v3 after some time has passed to review the new
rerouting bits.


[PATCH net-next v2 4/5] bpf: Add samples for LWT-BPF

2016-10-31 Thread Thomas Graf
This adds a set of samples demonstrating the use of lwt-bpf combined
with a shell script which allows running the samples in the form of
a basic selftest.

The samples include:
 - Allowing all packets
 - Dropping all packets
 - Printing context information
 - Access packet data
 - IPv4 daddr rewrite in dst_output()
 - L2 MAC header push + redirect in lwt xmit

Signed-off-by: Thomas Graf <tg...@suug.ch>
---
 samples/bpf/bpf_helpers.h   |   4 +
 samples/bpf/lwt_bpf.c   | 235 
 samples/bpf/test_lwt_bpf.sh | 370 
 3 files changed, 609 insertions(+)
 create mode 100644 samples/bpf/lwt_bpf.c
 create mode 100755 samples/bpf/test_lwt_bpf.sh

diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index 90f44bd..f34e417 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -80,6 +80,8 @@ struct bpf_map_def {
unsigned int map_flags;
 };
 
+static int (*bpf_skb_load_bytes)(void *ctx, int off, void *to, int len) =
+   (void *) BPF_FUNC_skb_load_bytes;
 static int (*bpf_skb_store_bytes)(void *ctx, int off, void *from, int len, int 
flags) =
(void *) BPF_FUNC_skb_store_bytes;
 static int (*bpf_l3_csum_replace)(void *ctx, int off, int from, int to, int 
flags) =
@@ -88,6 +90,8 @@ static int (*bpf_l4_csum_replace)(void *ctx, int off, int 
from, int to, int flag
(void *) BPF_FUNC_l4_csum_replace;
 static int (*bpf_skb_under_cgroup)(void *ctx, void *map, int index) =
(void *) BPF_FUNC_skb_under_cgroup;
+static int (*bpf_skb_push)(void *, int len, int flags) =
+   (void *) BPF_FUNC_skb_push;
 
 #if defined(__x86_64__)
 
diff --git a/samples/bpf/lwt_bpf.c b/samples/bpf/lwt_bpf.c
new file mode 100644
index 000..fc86275
--- /dev/null
+++ b/samples/bpf/lwt_bpf.c
@@ -0,0 +1,235 @@
+/* Copyright (c) 2016 Thomas Graf <tg...@tgraf.ch>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+#include 
+
+# define printk(fmt, ...)  \
+   ({  \
+   char fmt[] = fmt;   \
+   bpf_trace_printk(fmt, sizeof(fmt),  \
+##__VA_ARGS__);\
+   })
+
+#define CB_MAGIC 1234
+
+/* Let all packets pass */
+SEC("nop")
+int do_nop(struct __sk_buff *skb)
+{
+   return BPF_OK;
+}
+
+/* Print some context information per packet to tracing buffer.
+ */
+SEC("ctx_test")
+int do_ctx_test(struct __sk_buff *skb)
+{
+   skb->cb[0] = CB_MAGIC;
+   printk("len %d hash %d protocol %d\n", skb->len, skb->hash,
+  skb->protocol);
+   printk("cb %d ingress_ifindex %d ifindex %d\n", skb->cb[0],
+  skb->ingress_ifindex, skb->ifindex);
+
+   return BPF_OK;
+}
+
+/* Print content of skb->cb[] to tracing buffer */
+SEC("print_cb")
+int do_print_cb(struct __sk_buff *skb)
+{
+   printk("cb0: %x cb1: %x cb2: %x\n", skb->cb[0], skb->cb[1],
+  skb->cb[2]);
+   printk("cb3: %x cb4: %x\n", skb->cb[3], skb->cb[4]);
+
+   return BPF_OK;
+}
+
+/* Print source and destination IPv4 address to tracing buffer */
+SEC("data_test")
+int do_data_test(struct __sk_buff *skb)
+{
+   void *data = (void *)(long)skb->data;
+   void *data_end = (void *)(long)skb->data_end;
+   struct iphdr *iph = data;
+
+   if (data + sizeof(*iph) > data_end) {
+   printk("packet truncated\n");
+   return BPF_DROP;
+   }
+
+   printk("src: %x dst: %x\n", iph->saddr, iph->daddr);
+
+   return BPF_OK;
+}
+
+#define IP_CSUM_OFF offsetof(struct iphdr, check)
+#define IP_DST_OFF offsetof(struct iphdr, daddr)
+#define IP_SRC_OFF offsetof(struct iphdr, saddr)
+#define IP_PROTO_OFF offsetof(struct iphdr, protocol)
+#define TCP_CSUM_OFF offsetof(struct tcphdr, check)
+#define UDP_CSUM_OFF offsetof(struct udphdr, check)
+#define IS_PSEUDO 0x10
+
+static inline int rewrite(struct __sk_buff *skb, uint32_t old_ip,
+ uint32_t new_ip, int rw_daddr)
+{
+   int ret, off = 0, flags = IS_PSEUDO;
+   uint8_t proto;
+
+   ret = bpf_skb_load_bytes(skb, IP_PROTO_OFF, , 1);
+ 

[PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-10-31 Thread Thomas Graf
{Open question:
 Tom brought up the question on whether it is safe to modify the packet
 in artbirary ways before dst_output(). This is the equivalent to a raw
 socket injecting illegal headers. This v2 currently assumes that
 dst_output() is ready to accept invalid header values. This needs to be
 verified and if not the case, then raw sockets or dst_output() handlers
 must be fixed as well. Another option is to mark lwtunnel_output() as
 read-only for now.}

This series implements BPF program invocation from dst entries via the
lightweight tunnels infrastructure. The BPF program can be attached to
lwtunnel_input(), lwtunnel_output() or lwtunnel_xmit() and sees an L3
skb as context. input is read-only, output can write, xmit can write,
push headers, and redirect.

Motiviation for this work:
 - Restricting outgoing routes beyond what the route tuple supports
 - Per route accounting byond realms
 - Fast attachment of L2 headers where header does not require resolving
   L2 addresses
 - ILA like uses cases where L3 addresses are resolved and then routed
   in an async manner
 - Fast encapsulation + redirect. For now limited to use cases where not
   setting inner and outer offset/protocol is OK.

A couple of samples on how to use it can be found in patch 04.

v1 -> v2:
 - Added new BPF_LWT_REROUTE return code for program to indicate
   that new route lookup should be performed. Suggested by Tom.
 - New sample to illustrate rerouting
 - New patch 05: Recursion limit for lwtunnel_output for the case
   when user creates circular dst redirection. Also resolves the
   issue for ILA.
 - Fix to ensure headroom for potential future L2 header is still
   guaranteed

Thomas Graf (5):
  route: Set orig_output when redirecting to lwt on locally generated
traffic
  route: Set lwtstate for local traffic and cached input dsts
  bpf: BPF for lightweight tunnel encapsulation
  bpf: Add samples for LWT-BPF
  lwtunnel: Limit number of recursions on output to 5

 include/linux/filter.h|   2 +-
 include/uapi/linux/bpf.h  |  37 +++-
 include/uapi/linux/lwtunnel.h |  21 ++
 kernel/bpf/verifier.c |  16 +-
 net/Kconfig   |   1 +
 net/core/Makefile |   2 +-
 net/core/filter.c | 148 -
 net/core/lwt_bpf.c| 504 ++
 net/core/lwtunnel.c   |  15 +-
 net/ipv4/route.c  |  37 +++-
 samples/bpf/bpf_helpers.h |   4 +
 samples/bpf/lwt_bpf.c | 235 
 samples/bpf/test_lwt_bpf.sh   | 370 +++
 13 files changed, 1373 insertions(+), 19 deletions(-)
 create mode 100644 net/core/lwt_bpf.c
 create mode 100644 samples/bpf/lwt_bpf.c
 create mode 100755 samples/bpf/test_lwt_bpf.sh

-- 
2.7.4



[PATCH net-next v2 2/5] route: Set lwtstate for local traffic and cached input dsts

2016-10-31 Thread Thomas Graf
A route on the output path hitting a RTN_LOCAL route will keep the dst
associated on its way through the loopback device. On the receive path,
the dst_input() call will thus invoke the input handler of the route
created in the output path. Thus, lwt redirection for input must be done
for dsts allocated in the otuput path as well.

Also, if a route is cached in the input path, the allocated dst should
respect lwtunnel configuration on the nexthop as well.

Signed-off-by: Thomas Graf <tg...@suug.ch>
---
 net/ipv4/route.c | 39 ++-
 1 file changed, 26 insertions(+), 13 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 7da886e..44f5403 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1596,6 +1596,19 @@ static void ip_del_fnhe(struct fib_nh *nh, __be32 daddr)
spin_unlock_bh(_lock);
 }
 
+static void set_lwt_redirect(struct rtable *rth)
+{
+   if (lwtunnel_output_redirect(rth->dst.lwtstate)) {
+   rth->dst.lwtstate->orig_output = rth->dst.output;
+   rth->dst.output = lwtunnel_output;
+   }
+
+   if (lwtunnel_input_redirect(rth->dst.lwtstate)) {
+   rth->dst.lwtstate->orig_input = rth->dst.input;
+   rth->dst.input = lwtunnel_input;
+   }
+}
+
 /* called in rcu_read_lock() section */
 static int __mkroute_input(struct sk_buff *skb,
   const struct fib_result *res,
@@ -1685,14 +1698,7 @@ static int __mkroute_input(struct sk_buff *skb,
rth->dst.input = ip_forward;
 
rt_set_nexthop(rth, daddr, res, fnhe, res->fi, res->type, itag);
-   if (lwtunnel_output_redirect(rth->dst.lwtstate)) {
-   rth->dst.lwtstate->orig_output = rth->dst.output;
-   rth->dst.output = lwtunnel_output;
-   }
-   if (lwtunnel_input_redirect(rth->dst.lwtstate)) {
-   rth->dst.lwtstate->orig_input = rth->dst.input;
-   rth->dst.input = lwtunnel_input;
-   }
+   set_lwt_redirect(rth);
skb_dst_set(skb, >dst);
 out:
err = 0;
@@ -1919,8 +1925,18 @@ out: return err;
rth->dst.error= -err;
rth->rt_flags   &= ~RTCF_LOCAL;
}
+
if (do_cache) {
-   if (unlikely(!rt_cache_route(_RES_NH(res), rth))) {
+   struct fib_nh *nh = _RES_NH(res);
+
+   rth->dst.lwtstate = lwtstate_get(nh->nh_lwtstate);
+   if (lwtunnel_input_redirect(rth->dst.lwtstate)) {
+   WARN_ON(rth->dst.input == lwtunnel_input);
+   rth->dst.lwtstate->orig_input = rth->dst.input;
+   rth->dst.input = lwtunnel_input;
+   }
+
+   if (unlikely(!rt_cache_route(nh, rth))) {
rth->dst.flags |= DST_NOCACHE;
rt_add_uncached_list(rth);
}
@@ -2138,10 +2154,7 @@ static struct rtable *__mkroute_output(const struct 
fib_result *res,
}
 
rt_set_nexthop(rth, fl4->daddr, res, fnhe, fi, type, 0);
-   if (lwtunnel_output_redirect(rth->dst.lwtstate)) {
-   rth->dst.lwtstate->orig_output = rth->dst.output;
-   rth->dst.output = lwtunnel_output;
-   }
+   set_lwt_redirect(rth);
 
return rth;
 }
-- 
2.7.4



[PATCH net-next v2 5/5] lwtunnel: Limit number of recursions on output to 5

2016-10-31 Thread Thomas Graf
Signed-off-by: Thomas Graf <tg...@suug.ch>
---
 net/core/lwtunnel.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index 554d901..6363d0b 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -231,6 +231,10 @@ int lwtunnel_cmp_encap(struct lwtunnel_state *a, struct 
lwtunnel_state *b)
 }
 EXPORT_SYMBOL(lwtunnel_cmp_encap);
 
+/* Per CPU recursion counter for dst_output() redirections via LWT */
+#define DST_RECURSION_LIMIT 5
+DEFINE_PER_CPU(int, dst_recursion);
+
 int lwtunnel_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
struct dst_entry *dst = skb_dst(skb);
@@ -246,11 +250,19 @@ int lwtunnel_output(struct net *net, struct sock *sk, 
struct sk_buff *skb)
lwtstate->type > LWTUNNEL_ENCAP_MAX)
return 0;
 
+   if (unlikely(__this_cpu_read(dst_recursion) > DST_RECURSION_LIMIT)) {
+   net_crit_ratelimited("lwt: recursion limit reached of 
redirected dst_output calls\n");
+   return -EFAULT;
+   }
+
ret = -EOPNOTSUPP;
rcu_read_lock();
ops = rcu_dereference(lwtun_encaps[lwtstate->type]);
-   if (likely(ops && ops->output))
+   if (likely(ops && ops->output)) {
+   __this_cpu_inc(dst_recursion);
ret = ops->output(net, sk, skb);
+   __this_cpu_dec(dst_recursion);
+   }
rcu_read_unlock();
 
if (ret == -EOPNOTSUPP)
-- 
2.7.4



[PATCH net-next v2 3/5] bpf: BPF for lightweight tunnel encapsulation

2016-10-31 Thread Thomas Graf
Register two new BPF prog types BPF_PROG_TYPE_LWT_IN and
BPF_PROG_TYPE_LWT_OUT which are invoked if a route contains a
LWT redirection of type LWTUNNEL_ENCAP_BPF.

The separate program types are required because manipulation of
packet data is only allowed on the output and transmit path as
the subsequent dst_input() call path assumes an IP header
validated by ip_rcv(). The BPF programs will be handed an skb
with the L3 header attached and may return one of the following
return codes:

 BPF_OK - Continue routing as per nexthop
 BPF_DROP - Drop skb and return EPERM
 BPF_REDIRECT - Redirect skb to device as per redirect() helper.
(Only valid on lwtunnel_xmit() hook)

The return codes are binary compatible with their TC_ACT_
relatives to ease compatibility.

A new helper bpf_skb_push() is added which allows to preprend an
L2 header in front of the skb, extend the existing L3 header, or
both. This allows to address a wide range of issues:
 - Optimize L2 header construction when L2 information is always
   static to avoid ARP/NDisc lookup.
 - Extend IP header to add additional IP options.
 - Perform simple encapsulation where offload is of no concern.
   (The existing funtionality to attach a tunnel key to the skb
and redirect to a tunnel net_device to allow for offload
continues to work obviously).

Signed-off-by: Thomas Graf <tg...@suug.ch>
---
 include/linux/filter.h|   2 +-
 include/uapi/linux/bpf.h  |  37 +++-
 include/uapi/linux/lwtunnel.h |  21 ++
 kernel/bpf/verifier.c |  16 +-
 net/Kconfig   |   1 +
 net/core/Makefile |   2 +-
 net/core/filter.c | 148 -
 net/core/lwt_bpf.c| 504 ++
 net/core/lwtunnel.c   |   1 +
 9 files changed, 725 insertions(+), 7 deletions(-)
 create mode 100644 net/core/lwt_bpf.c

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 1f09c52..aad7f81 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -438,7 +438,7 @@ struct xdp_buff {
 };
 
 /* compute the linear packet data range [data, data_end) which
- * will be accessed by cls_bpf and act_bpf programs
+ * will be accessed by cls_bpf, act_bpf and lwt programs
  */
 static inline void bpf_compute_data_end(struct sk_buff *skb)
 {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e2f38e0..c034a2d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -96,6 +96,9 @@ enum bpf_prog_type {
BPF_PROG_TYPE_TRACEPOINT,
BPF_PROG_TYPE_XDP,
BPF_PROG_TYPE_PERF_EVENT,
+   BPF_PROG_TYPE_LWT_IN,
+   BPF_PROG_TYPE_LWT_OUT,
+   BPF_PROG_TYPE_LWT_XMIT,
 };
 
 #define BPF_PSEUDO_MAP_FD  1
@@ -383,6 +386,16 @@ union bpf_attr {
  *
  * int bpf_get_numa_node_id()
  * Return: Id of current NUMA node.
+ *
+ * int bpf_skb_push()
+ * Add room to beginning of skb and adjusts MAC header offset accordingly.
+ * Extends/reallocaes for needed skb headeroom automatically.
+ * May change skb data pointer and will thus invalidate any check done
+ * for direct packet access.
+ * @skb: pointer to skb
+ * @len: length of header to be pushed in front
+ * @flags: Flags (unused for now)
+ * Return: 0 on success or negative error
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -427,7 +440,8 @@ union bpf_attr {
FN(skb_pull_data),  \
FN(csum_update),\
FN(set_hash_invalid),   \
-   FN(get_numa_node_id),
+   FN(get_numa_node_id),   \
+   FN(skb_push),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -511,6 +525,27 @@ struct bpf_tunnel_key {
__u32 tunnel_label;
 };
 
+/* Generic BPF return codes which all BPF program types may support.
+ * The values are binary compatible with their TC_ACT_* counter-part to
+ * provide backwards compatibility with existing SCHED_CLS and SCHED_ACT
+ * programs.
+ *
+ * XDP is handled seprately, see XDP_*.
+ */
+enum bpf_ret_code {
+   BPF_OK = 0,
+   /* 1 reserved */
+   BPF_DROP = 2,
+   /* 3-6 reserved */
+   BPF_REDIRECT = 7,
+   /* >127 are reserved for prog type specific return codes */
+};
+
+/* LWT specific return codes */
+enum bpf_lwt_ret_code {
+   BPF_LWT_REROUTE = 128,
+};
+
 /* User return codes for XDP prog type.
  * A valid XDP program must return one of these defined values. All other
  * return codes are reserved for future use. Unknown return codes will result
diff --git a/include/uapi/linux/lwtunnel.h b/include/uapi/linux/lwtunnel.h
index a478fe8..9354d997 100644
--- a/include/uapi/linux/lwtunnel.h
+++ b/include/uapi/linux/lwtunnel.h
@@ -9,6 +9,7 @@ enum lwtunnel_encap_types {
LWTUNNEL_ENCAP_IP,
LWTUNNEL_ENCAP_ILA,
LWTUNNEL_ENCAP_IP6,
+   LWTUNNEL_

[PATCH net-next v2 1/5] route: Set orig_output when redirecting to lwt on locally generated traffic

2016-10-31 Thread Thomas Graf
orig_output for IPv4 was only set for dsts which hit an input route.
Set it consistently for locally generated traffic as well to allow
lwt to continue the dst_output() path as configured by the nexthop.

Fixes: 2536862311d ("lwt: Add support to redirect dst.input")
Signed-off-by: Thomas Graf <tg...@suug.ch>
---
 net/ipv4/route.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 62d4d90..7da886e 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2138,8 +2138,10 @@ static struct rtable *__mkroute_output(const struct 
fib_result *res,
}
 
rt_set_nexthop(rth, fl4->daddr, res, fnhe, fi, type, 0);
-   if (lwtunnel_output_redirect(rth->dst.lwtstate))
+   if (lwtunnel_output_redirect(rth->dst.lwtstate)) {
+   rth->dst.lwtstate->orig_output = rth->dst.output;
rth->dst.output = lwtunnel_output;
+   }
 
return rth;
 }
-- 
2.7.4



Re: XDP question - how much can BPF change in xdp_buff?

2016-10-31 Thread Thomas Graf
On 10/31/16 at 12:22pm, John Fastabend wrote:
> On 16-10-31 11:57 AM, David Miller wrote:
> > My understanding is that the eBPF program would be responsible
> > for updating the checksum if it mangles the packet in such a
> > way that such a fixup would be required.
> > 
> 
> For XDP we will probably need to add support for at minimum the
> following helpers,
> 
>   bpf_l3_csum_replace
>   bpf_l4_csum_replace
csum_diff

We definitely want some visibility feature that can be enabled for
troubleshooting and debugging which verifies the checksum in SW
after the bpf program is done. Otherwise, if a XDP BPF program
miscalculates the checksum, there is no way to figure it out whether
the checksum is off without attaching another system to capture.

Speaking from experience, getting the checksum right is one of the
time sinks when developing more complex BPF programs.


Re: [PATCH v2 net-next 1/5] bpf: Refactor cgroups code in prep for new type

2016-10-31 Thread Thomas Graf
On 10/31/16 at 06:16pm, Daniel Mack wrote:
> On 10/31/2016 06:05 PM, David Ahern wrote:
> > On 10/31/16 11:00 AM, Daniel Mack wrote:
> >> Yeah, I'm confused too. I changed that name in my v7 from 
> >> BPF_PROG_TYPE_CGROUP_SOCK to BPF_PROG_TYPE_CGROUP_SKB on David's
> >> (Ahern) request. Why is it now renamed again?
> > 
> > Thomas pushed back on adding another program type in favor of using
> > subtypes. So this makes the program type generic to CGROUP and patch
> > 2 in this v2 set added Mickaël's subtype patch with the socket
> > mangling done that way in patch 3.
> > 
> 
> Fine for me. I can change it around again.

I would like to hear from Daniel B and Alexei as well. We need to
decide whether to use subtypes consistently and treat prog types as
something more high level or whether to bluntly introduce a new prog
type for every distinct set of verifier limits. I will change lwt_bpf
as well accordingly.


Re: [PATCH v2 net-next 0/5] Add bpf support to set sk_bound_dev_if

2016-10-31 Thread Thomas Graf
On 10/31/16 at 11:16am, David Ahern wrote:
> On 10/31/16 11:01 AM, David Miller wrote:
> > Also, any reason why you don't allow the cgroup bpf sk filter to return
> > an error code so that the sock creation could be cancelled if the eBPF
> > program desires that?  It could be useful, I suppose.
> 
> My first draft at this feature had that but I removed it for simplicity now. 
> Can certainly add it back.

We're trying to standardize on common return codes for all program
types. The lwt bpf series defines BPF_ codes which are compatible
with TC_ACT_* values to make lwt_bpf and cls_bpf compatible. Would
be great to use the same return codes and implement the ones that
make sense.


Re: [PATCH net-next 3/4] bpf: BPF for lightweight tunnel encapsulation

2016-10-31 Thread Thomas Graf
On 10/31/16 at 09:07am, Tom Herbert wrote:
> I guess this leads to a more general question I have about the effects
> of allowing userspace to insert code in the kernel that modifies
> packets. If we allow BPF programs to arbitrarily modify packets in
> LWT, how do we ensure that there are no insidious effects later in the
> path? For instance,  what someone uses BPF to convert an IPv6 packet
> to IPv4, or maybe convert packet to something that isn't even IP, or
> what if someone just decides to overwrite every byte in a packet with
> 0xff?

This is why modifying packets is not allowed on input at all as it
would invalidate the IP parsing that has already been done.

Writing is allowed for dst_output() on the basis that it is the
equivalent of a raw socket with header inclusion. If you look at
rawv6_send_hdrinc(), it does not perform any validation and calls into
dst_output() directly. I agree though that this must be made water
proof.

Pushing additional headers is only allowed at xmit, this is the
equivalent LWT MPLS.

> Are these thing allowed, and if so what is the effect? I would
> assume a policy that these can't cause any insidious effects to
> unrelated traffic or the rest of the system, in particular such things
> should not cause the  kernel to crash (based on the principle that
> user space code should never cause kernel to crash). I think XDP might

Agreed. Although it's already possible to hook a kernel module at LWT
or Netfilter to do arbitrary packet modifications, BPF must be held
at a higher standard even in privileged mode.


Re: [PATCH net-next 3/4] bpf: BPF for lightweight tunnel encapsulation

2016-10-31 Thread Thomas Graf
On 10/31/16 at 07:17am, Tom Herbert wrote:
> On Mon, Oct 31, 2016 at 5:59 AM, Thomas Graf <tg...@suug.ch> wrote:
> > Noticed while implementing this: How does ILA ensure that dst_output()
> > is not invoked in a circular manner?
> >
> > dstA->output() -> dstB->otuput() -> dstA->output() -> ...
> 
> It doesn't. We'll need to add a check for that. Maybe the rule should
> be that an skbuff is only allowed to hit one LWT route?

I'll add a per cpu variable to do a recursion limit for dst_output()
which callers to possibly recurse can use. ILA can use that as well.

> Another scenario to consider: Suppose someone is doing protocol
> translation like in RFC7915. This is one of operations we'd need with
> ILA or GRE to implement an IPv4 overlay network over IPv6. Would this
> be allowed/supported in LWT BPF?

In lwtunnel_xmit() yes, input and output would not support this right
now. It will need some logic as the orig_input and orig_output would
obviously be expecting the same protocol. This is the reason why the
xmit prog type currently has a wider set of allowed helpers.


Re: [PATCH net-next 3/4] bpf: BPF for lightweight tunnel encapsulation

2016-10-31 Thread Thomas Graf
On 10/30/16 at 06:28pm, Tom Herbert wrote:
> Right, that's why we rely on a dst cache. Any use of LWT that
> encapsulates or tunnels to a fixed destination (ILA, VXLAN, IPIP,
> etc.) would want to use the dst cache optimization to avoid the second
> lookup. The ILA LWT code used to call orig output and that worked as
> long as we could set the default router as the gateway "via". It was
> something we were able to deploy, but not a general solution.
> Integrating properly with routing gives a much better solution IMO.
> Note that David Lebrun's latest LWT Segment Routing patch does the
> second lookup with the dst cache to try to avoid it.

Noticed while implementing this: How does ILA ensure that dst_output()
is not invoked in a circular manner?

dstA->output() -> dstB->otuput() -> dstA->output() -> ...


Re: [PATCH net-next 3/4] bpf: BPF for lightweight tunnel encapsulation

2016-10-31 Thread Thomas Graf
On 10/30/16 at 06:28pm, Tom Herbert wrote:
> On Sun, Oct 30, 2016 at 2:47 PM, Thomas Graf <tg...@suug.ch> wrote:
> > Instead of building complex logic, we can allow the program to return
> > a code to indicate when to perform another route lookup just as we do
> > for the redirect case. Just because the destination address has
> > changed may not require another lookup in all cases. A typical example
> > would be a program rewriting addresses for the default route to other
> > address which are always handled by the default route as well. An
> > unconditional lookup would hurt performance in many cases.
> 
> Right, that's why we rely on a dst cache. Any use of LWT that
> encapsulates or tunnels to a fixed destination (ILA, VXLAN, IPIP,
> etc.) would want to use the dst cache optimization to avoid the second
> lookup. The ILA LWT code used to call orig output and that worked as
> long as we could set the default router as the gateway "via". It was
> something we were able to deploy, but not a general solution.
> Integrating properly with routing gives a much better solution IMO.
> Note that David Lebrun's latest LWT Segment Routing patch does the
> second lookup with the dst cache to try to avoid it.

Yes, I saw both ILA and SR dst_cache. I was planning on addressing
the conditional reroute in a second step but it looks fairly simple
actually so I'm fine adding this in a v2 based on a return code. I
will limit lwt-bpf to AF_INET && AF_INET6 though.


Re: [PATCH net-next 3/4] bpf: BPF for lightweight tunnel encapsulation

2016-10-30 Thread Thomas Graf
On 10/30/16 at 01:34pm, Tom Herbert wrote:
> On Sun, Oct 30, 2016 at 4:58 AM, Thomas Graf <tg...@suug.ch> wrote:
> > +   if (unlikely(!dst->lwtstate->orig_output)) {
> > +   WARN_ONCE(1, "orig_output not set on dst for prog %s\n",
> > + bpf->out.name);
> > +   kfree_skb(skb);
> > +   return -EINVAL;
> > +   }
> > +
> > +   return dst->lwtstate->orig_output(net, sk, skb);
> 
> The BPF program may have changed the destination address so continuing
> with original route in skb may not be appropriate here. This was fixed
> in ila_lwt by calling ip6_route_output and we were able to dst cache
> facility to cache the route to avoid cost of looking it up on every
> packet. Since the kernel  has no insight into what the BPF program
> does to the packet I'd suggest 1) checking if destination address
> changed by BPF and if it did then call route_output to get new route
> 2) If the LWT destination is a host route then try to keep a dst
> cache. This would entail checking destination address on return that
> it is the same one as kept in the dst cache.

Instead of building complex logic, we can allow the program to return
a code to indicate when to perform another route lookup just as we do
for the redirect case. Just because the destination address has
changed may not require another lookup in all cases. A typical example
would be a program rewriting addresses for the default route to other
address which are always handled by the default route as well. An
unconditional lookup would hurt performance in many cases.


[PATCH net-next 1/4] route: Set orig_output when redirecting to lwt on locally generated traffic

2016-10-30 Thread Thomas Graf
orig_output for IPv4 was only set for dsts which hit an input route.
Set it consistently for locally generated traffic as well to allow
lwt to continue the dst_output() path as configured by the nexthop.

Fixes: 2536862311d ("lwt: Add support to redirect dst.input")
Signed-off-by: Thomas Graf <tg...@suug.ch>
---
 net/ipv4/route.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 62d4d90..7da886e 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2138,8 +2138,10 @@ static struct rtable *__mkroute_output(const struct 
fib_result *res,
}
 
rt_set_nexthop(rth, fl4->daddr, res, fnhe, fi, type, 0);
-   if (lwtunnel_output_redirect(rth->dst.lwtstate))
+   if (lwtunnel_output_redirect(rth->dst.lwtstate)) {
+   rth->dst.lwtstate->orig_output = rth->dst.output;
rth->dst.output = lwtunnel_output;
+   }
 
return rth;
 }
-- 
2.7.4



[PATCH net-next 2/4] route: Set lwtstate for local traffic and cached input dsts

2016-10-30 Thread Thomas Graf
A route on the output path hitting a RTN_LOCAL route will keep the dst
associated on its way through the loopback device. On the receive path,
the dst_input() call will thus invoke the input handler of the route
created in the output path. Thus, lwt redirection for input must be done
for dsts allocated in the otuput path as well.

Also, if a route is cached in the input path, the allocated dst should
respect lwtunnel configuration on the nexthop as well.

Signed-off-by: Thomas Graf <tg...@suug.ch>
---
 net/ipv4/route.c | 39 ++-
 1 file changed, 26 insertions(+), 13 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 7da886e..44f5403 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1596,6 +1596,19 @@ static void ip_del_fnhe(struct fib_nh *nh, __be32 daddr)
spin_unlock_bh(_lock);
 }
 
+static void set_lwt_redirect(struct rtable *rth)
+{
+   if (lwtunnel_output_redirect(rth->dst.lwtstate)) {
+   rth->dst.lwtstate->orig_output = rth->dst.output;
+   rth->dst.output = lwtunnel_output;
+   }
+
+   if (lwtunnel_input_redirect(rth->dst.lwtstate)) {
+   rth->dst.lwtstate->orig_input = rth->dst.input;
+   rth->dst.input = lwtunnel_input;
+   }
+}
+
 /* called in rcu_read_lock() section */
 static int __mkroute_input(struct sk_buff *skb,
   const struct fib_result *res,
@@ -1685,14 +1698,7 @@ static int __mkroute_input(struct sk_buff *skb,
rth->dst.input = ip_forward;
 
rt_set_nexthop(rth, daddr, res, fnhe, res->fi, res->type, itag);
-   if (lwtunnel_output_redirect(rth->dst.lwtstate)) {
-   rth->dst.lwtstate->orig_output = rth->dst.output;
-   rth->dst.output = lwtunnel_output;
-   }
-   if (lwtunnel_input_redirect(rth->dst.lwtstate)) {
-   rth->dst.lwtstate->orig_input = rth->dst.input;
-   rth->dst.input = lwtunnel_input;
-   }
+   set_lwt_redirect(rth);
skb_dst_set(skb, >dst);
 out:
err = 0;
@@ -1919,8 +1925,18 @@ out: return err;
rth->dst.error= -err;
rth->rt_flags   &= ~RTCF_LOCAL;
}
+
if (do_cache) {
-   if (unlikely(!rt_cache_route(_RES_NH(res), rth))) {
+   struct fib_nh *nh = _RES_NH(res);
+
+   rth->dst.lwtstate = lwtstate_get(nh->nh_lwtstate);
+   if (lwtunnel_input_redirect(rth->dst.lwtstate)) {
+   WARN_ON(rth->dst.input == lwtunnel_input);
+   rth->dst.lwtstate->orig_input = rth->dst.input;
+   rth->dst.input = lwtunnel_input;
+   }
+
+   if (unlikely(!rt_cache_route(nh, rth))) {
rth->dst.flags |= DST_NOCACHE;
rt_add_uncached_list(rth);
}
@@ -2138,10 +2154,7 @@ static struct rtable *__mkroute_output(const struct 
fib_result *res,
}
 
rt_set_nexthop(rth, fl4->daddr, res, fnhe, fi, type, 0);
-   if (lwtunnel_output_redirect(rth->dst.lwtstate)) {
-   rth->dst.lwtstate->orig_output = rth->dst.output;
-   rth->dst.output = lwtunnel_output;
-   }
+   set_lwt_redirect(rth);
 
return rth;
 }
-- 
2.7.4



[PATCH net-next 4/4] bpf: Add samples for LWT-BPF

2016-10-30 Thread Thomas Graf
This adds a set of samples demonstrating the use of lwt-bpf combined
with a shell script which allows running the samples in the form of
a basic selftest.

The samples include:
 - Allowing all packets
 - Dropping all packets
 - Printing context information
 - Access packet data
 - IPv4 daddr rewrite in dst_output()
 - L2 MAC header push + redirect in lwt xmit

Signed-off-by: Thomas Graf <tg...@suug.ch>
---
 samples/bpf/bpf_helpers.h   |   4 +
 samples/bpf/lwt_bpf.c   | 210 +++
 samples/bpf/test_lwt_bpf.sh | 337 
 3 files changed, 551 insertions(+)
 create mode 100644 samples/bpf/lwt_bpf.c
 create mode 100755 samples/bpf/test_lwt_bpf.sh

diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index 90f44bd..f34e417 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -80,6 +80,8 @@ struct bpf_map_def {
unsigned int map_flags;
 };
 
+static int (*bpf_skb_load_bytes)(void *ctx, int off, void *to, int len) =
+   (void *) BPF_FUNC_skb_load_bytes;
 static int (*bpf_skb_store_bytes)(void *ctx, int off, void *from, int len, int 
flags) =
(void *) BPF_FUNC_skb_store_bytes;
 static int (*bpf_l3_csum_replace)(void *ctx, int off, int from, int to, int 
flags) =
@@ -88,6 +90,8 @@ static int (*bpf_l4_csum_replace)(void *ctx, int off, int 
from, int to, int flag
(void *) BPF_FUNC_l4_csum_replace;
 static int (*bpf_skb_under_cgroup)(void *ctx, void *map, int index) =
(void *) BPF_FUNC_skb_under_cgroup;
+static int (*bpf_skb_push)(void *, int len, int flags) =
+   (void *) BPF_FUNC_skb_push;
 
 #if defined(__x86_64__)
 
diff --git a/samples/bpf/lwt_bpf.c b/samples/bpf/lwt_bpf.c
new file mode 100644
index 000..05be6ac
--- /dev/null
+++ b/samples/bpf/lwt_bpf.c
@@ -0,0 +1,210 @@
+/* Copyright (c) 2016 Thomas Graf <tg...@tgraf.ch>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+#include 
+
+# define printk(fmt, ...)  \
+   ({  \
+   char fmt[] = fmt;   \
+   bpf_trace_printk(fmt, sizeof(fmt),  \
+##__VA_ARGS__);\
+   })
+
+#define CB_MAGIC 1234
+
+/* Let all packets pass */
+SEC("nop")
+int do_nop(struct __sk_buff *skb)
+{
+   return BPF_OK;
+}
+
+/* Print some context information per packet to tracing buffer.
+ */
+SEC("ctx_test")
+int do_ctx_test(struct __sk_buff *skb)
+{
+   skb->cb[0] = CB_MAGIC;
+   printk("len %d hash %d protocol %d\n", skb->len, skb->hash,
+  skb->protocol);
+   printk("cb %d ingress_ifindex %d ifindex %d\n", skb->cb[0],
+  skb->ingress_ifindex, skb->ifindex);
+
+   return BPF_OK;
+}
+
+/* Print content of skb->cb[] to tracing buffer */
+SEC("print_cb")
+int do_print_cb(struct __sk_buff *skb)
+{
+   printk("cb0: %x cb1: %x cb2: %x\n", skb->cb[0], skb->cb[1],
+  skb->cb[2]);
+   printk("cb3: %x cb4: %x\n", skb->cb[3], skb->cb[4]);
+
+   return BPF_OK;
+}
+
+/* Print source and destination IPv4 address to tracing buffer */
+SEC("data_test")
+int do_data_test(struct __sk_buff *skb)
+{
+   void *data = (void *)(long)skb->data;
+   void *data_end = (void *)(long)skb->data_end;
+   struct iphdr *iph = data;
+
+   if (data + sizeof(*iph) > data_end) {
+   printk("packet truncated\n");
+   return BPF_DROP;
+   }
+
+   printk("src: %x dst: %x\n", iph->saddr, iph->daddr);
+
+   return BPF_OK;
+}
+
+#define IP_CSUM_OFF offsetof(struct iphdr, check)
+#define IP_DST_OFF offsetof(struct iphdr, daddr)
+#define IP_SRC_OFF offsetof(struct iphdr, saddr)
+#define IP_PROTO_OFF offsetof(struct iphdr, protocol)
+#define TCP_CSUM_OFF offsetof(struct tcphdr, check)
+#define UDP_CSUM_OFF offsetof(struct udphdr, check)
+#define IS_PSEUDO 0x10
+
+static inline int rewrite(struct __sk_buff *skb, uint32_t old_ip,
+ uint32_t new_ip, int rw_daddr)
+{
+   int ret, off = 0, flags = IS_PSEUDO;
+   uint8_t proto;
+
+   ret = bpf_skb_load_bytes(skb, IP_PROTO_OFF, , 1);
+ 

[PATCH net-next 3/4] bpf: BPF for lightweight tunnel encapsulation

2016-10-30 Thread Thomas Graf
Register two new BPF prog types BPF_PROG_TYPE_LWT_IN and
BPF_PROG_TYPE_LWT_OUT which are invoked if a route contains a
LWT redirection of type LWTUNNEL_ENCAP_BPF.

The separate program types are required because manipulation of
packet data is only allowed on the output and transmit path as
the subsequent dst_input() call path assumes an IP header
validated by ip_rcv(). The BPF programs will be handed an skb
with the L3 header attached and may return one of the following
return codes:

 BPF_OK - Continue routing as per nexthop
 BPF_DROP - Drop skb and return EPERM
 BPF_REDIRECT - Redirect skb to device as per redirect() helper.
(Only valid on lwtunnel_xmit() hook)

The return codes are binary compatible with their TC_ACT_
relatives to ease compatibility.

A new helper bpf_skb_push() is added which allows to preprend an
L2 header in front of the skb, extend the existing L3 header, or
both. This allows to address a wide range of issues:
 - Optimize L2 header construction when L2 information is always
   static to avoid ARP/NDisc lookup.
 - Extend IP header to add additional IP options.
 - Perform simple encapsulation where offload is of no concern.
   (The existing funtionality to attach a tunnel key to the skb
and redirect to a tunnel net_device to allow for offload
continues to work obviously).

Signed-off-by: Thomas Graf <tg...@suug.ch>
---
 include/linux/filter.h|   2 +-
 include/uapi/linux/bpf.h  |  31 +++-
 include/uapi/linux/lwtunnel.h |  21 +++
 kernel/bpf/verifier.c |  16 +-
 net/core/Makefile |   2 +-
 net/core/filter.c | 148 -
 net/core/lwt_bpf.c| 365 ++
 net/core/lwtunnel.c   |   1 +
 8 files changed, 579 insertions(+), 7 deletions(-)
 create mode 100644 net/core/lwt_bpf.c

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 1f09c52..aad7f81 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -438,7 +438,7 @@ struct xdp_buff {
 };
 
 /* compute the linear packet data range [data, data_end) which
- * will be accessed by cls_bpf and act_bpf programs
+ * will be accessed by cls_bpf, act_bpf and lwt programs
  */
 static inline void bpf_compute_data_end(struct sk_buff *skb)
 {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e2f38e0..2ebaa3c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -96,6 +96,9 @@ enum bpf_prog_type {
BPF_PROG_TYPE_TRACEPOINT,
BPF_PROG_TYPE_XDP,
BPF_PROG_TYPE_PERF_EVENT,
+   BPF_PROG_TYPE_LWT_IN,
+   BPF_PROG_TYPE_LWT_OUT,
+   BPF_PROG_TYPE_LWT_XMIT,
 };
 
 #define BPF_PSEUDO_MAP_FD  1
@@ -383,6 +386,16 @@ union bpf_attr {
  *
  * int bpf_get_numa_node_id()
  * Return: Id of current NUMA node.
+ *
+ * int bpf_skb_push()
+ * Add room to beginning of skb and adjusts MAC header offset accordingly.
+ * Extends/reallocaes for needed skb headeroom automatically.
+ * May change skb data pointer and will thus invalidate any check done
+ * for direct packet access.
+ * @skb: pointer to skb
+ * @len: length of header to be pushed in front
+ * @flags: Flags (unused for now)
+ * Return: 0 on success or negative error
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -427,7 +440,8 @@ union bpf_attr {
FN(skb_pull_data),  \
FN(csum_update),\
FN(set_hash_invalid),   \
-   FN(get_numa_node_id),
+   FN(get_numa_node_id),   \
+   FN(skb_push),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -511,6 +525,21 @@ struct bpf_tunnel_key {
__u32 tunnel_label;
 };
 
+/* Generic BPF return codes which all BPF program types may support.
+ * The values are binary compatible with their TC_ACT_* counter-part to
+ * provide backwards compatibility with existing SCHED_CLS and SCHED_ACT
+ * programs.
+ *
+ * XDP is handled seprately, see XDP_*.
+ */
+enum bpf_ret_code {
+   BPF_OK = 0,
+   /* 1 reserved */
+   BPF_DROP = 2,
+   /* 3-6 reserved */
+   BPF_REDIRECT = 7,
+};
+
 /* User return codes for XDP prog type.
  * A valid XDP program must return one of these defined values. All other
  * return codes are reserved for future use. Unknown return codes will result
diff --git a/include/uapi/linux/lwtunnel.h b/include/uapi/linux/lwtunnel.h
index a478fe8..9354d997 100644
--- a/include/uapi/linux/lwtunnel.h
+++ b/include/uapi/linux/lwtunnel.h
@@ -9,6 +9,7 @@ enum lwtunnel_encap_types {
LWTUNNEL_ENCAP_IP,
LWTUNNEL_ENCAP_ILA,
LWTUNNEL_ENCAP_IP6,
+   LWTUNNEL_ENCAP_BPF,
__LWTUNNEL_ENCAP_MAX,
 };
 
@@ -42,4 +43,24 @@ enum lwtunnel_ip6_t {
 
 #define LWTUNNEL_IP6_MAX (__LWTUNNEL_IP6_MAX - 1)
 
+enum {
+   LWT_BPF_PROG_UNSPEC,
+   LWT_BPF_P

[PATCH net-next 0/4] BPF for lightweight tunnel encapsulation

2016-10-30 Thread Thomas Graf
This series implements BPF program invocation from dst entries via the
lightweight tunnels infrastructure. The BPF program can be attached to
lwtunnel_input(), lwtunnel_output() or lwtunnel_xmit() and sees an L3
skb as context. input is read-only, output can write, xmit can write,
push headers, and redirect.

Motiviation for this work:
 - Restricting outgoing routes beyond what the route tuple supports
 - Per route accounting byond realms
 - Fast attachment of L2 headers where header does not require resolving
   L2 addresses
 - ILA like uses cases where L3 addresses are resolved and then routed
   in an async manner
 - Fast encapsulation + redirect. For now limited to use cases where not 
   setting inner and outer offset/protocol is OK.

A couple of samples on how to use it can be found in patch 04.

Thomas Graf (4):
  route: Set orig_output when redirecting to lwt on locally generated
traffic
  route: Set lwtstate for local traffic and cached input dsts
  bpf: BPF for lightweight tunnel encapsulation
  bpf: Add samples for LWT-BPF

 include/linux/filter.h|   2 +-
 include/uapi/linux/bpf.h  |  31 +++-
 include/uapi/linux/lwtunnel.h |  21 +++
 kernel/bpf/verifier.c |  16 +-
 net/core/Makefile |   2 +-
 net/core/filter.c | 148 -
 net/core/lwt_bpf.c| 365 ++
 net/core/lwtunnel.c   |   1 +
 net/ipv4/route.c  |  37 +++--
 samples/bpf/bpf_helpers.h |   4 +
 samples/bpf/lwt_bpf.c | 210 
 samples/bpf/test_lwt_bpf.sh   | 337 ++
 12 files changed, 1156 insertions(+), 18 deletions(-)
 create mode 100644 net/core/lwt_bpf.c
 create mode 100644 samples/bpf/lwt_bpf.c
 create mode 100755 samples/bpf/test_lwt_bpf.sh

-- 
2.7.4



Re: Let's do P4

2016-10-30 Thread Thomas Graf
On 10/30/16 at 08:44am, Jiri Pirko wrote:
> Sat, Oct 29, 2016 at 06:46:21PM CEST, john.fastab...@gmail.com wrote:
> >On 16-10-29 07:49 AM, Jakub Kicinski wrote:
> >> On Sat, 29 Oct 2016 09:53:28 +0200, Jiri Pirko wrote:
> >>> Hi all.
> >>>
> >>> The network world is divided into 2 general types of hw:
> >>> 1) network ASICs - network specific silicon, containing things like TCAM
> >>>These ASICs are suitable to be programmed by P4.
> >>> 2) network processors - basically a general purpose CPUs
> >>>These processors are suitable to be programmed by eBPF.
> >>>
> >>> I believe that by now, the most people came to a conclusion that it is
> >>> very difficult to handle both types by either P4 or eBPF. And since
> >>> eBPF is part of the kernel, I would like to introduce P4 into kernel
> >>> as well. Here's a plan:
> >>>
> >>> 1) Define P4 intermediate representation
> >>>I cannot imagine loading P4 program (c-like syntax text file) into
> >>>kernel as is. That means that as the first step, we need find some
> >>>intermediate representation. I can imagine someting in a form of AST,
> >>>call it "p4ast". I don't really know how to do this exactly though,
> >>>it's just an idea.
> >>>
> >>>In the end there would be a userspace precompiler for this:
> >>>$ makep4ast example.p4 example.ast
> >> 
> >> Maybe stating the obvious, but IMHO defining the IR is the hardest part.
> >> eBPF *is* the IR, we can compile C, P4 or even JIT Lua to eBPF.  The
> >> AST/IR for switch pipelines should allow for similar flexibility.
> >> Looser coupling would also protect us from changes in spec of the high
> >> level language.

My assumption was that a new IR is defined which is easier to parse than
eBPF which is targeted at execution on a CPU and not indented for pattern
matching. Just looking at how llvm creates different patterns and reorders
instructions, I'm not seeing how eBPF can serve as a general purpose IR
if the objective is to allow fairly flexible generation of the bytecode.
Hence the alternative IR serving as additional metadata complementing the
eBPF program.

> >Jumping in the middle here. You managed to get an entire thread going
> >before I even woke up :)
> >
> >The problem with eBPF as an IR is that in the universe of eBPF IR
> >programs the subset that can be offloaded onto a standard ASIC based
> >hardware (non NPU/FPGA/etc) is so small to be almost meaningless IMO.
> >
> >I tried this for awhile and the result is users have to write very
> >targeted eBPF that they "know" will be pattern matched and pushed into
> >an ASIC. It can work but its very fragile. When I did this I ended up
> >with an eBPF generator for deviceX and an eBPF generator for deviceY
> >each with a very specific pattern matching engine in the driver to
> >xlate ebpf-deviceX into its asic. Existing ASICs for example usually
> >support only one pipeline, only one parser (or require moving mountains
> >to change the parse via ucode), only one set of tables, and only one
> >deparser/serailizer at the end to build the new packet. Next-gen pieces
> >may have some flexibility on the parser side.
> >
> >There is an interesting resource allocation problem we have that could
> >be solved by p4 or devlink where in we want to pre-allocate slices of
> >the TCAM for certain match types. I was planning on writing devlink code
> >for this because its primarily done at initialization once.
> 
> There are 2 resource allocation problems in our hw. One is general
> division ot the resources in feature-chunks. That needs to be done
> during the ASIC initialization phase. For that, I also plan to utilize
> devlink API.
> 
> The second one is runtime allocation of tables, and that would be
> handled by p4 just fine.
> 
> 
> >
> >I will note one nice thing about using eBPF however is that you have an
> >easy software emulation path via ebpf engine in kernel.
> >
> >... And merging threads here with Jiri's email ...
> >
> >> If you do p4>ebpf in userspace, you have 2 apis:
> >> 1) to setup sw (in-kernel) p4 datapath, you push bpf.o to kernel
> >> 2) to setup hw p4 datapath, you push program.p4ast to kernel
> >> 
> >> Those are 2 apis. Both wrapped up by TC, but still 2 apis.
> >> 
> >> What I believe is correct is to have one api:
> >> 1) to setup sw (in-kernel) p4 datapath, you push program.p4ast to kernel
> >> 2) to setup hw p4 datapath, you push program.p4ast to kernel

I understand what you mean with two APIs now. You want a single IR
block and divide the SW/HW part in the kernel rather than let llvm or
something else do it.

> >Couple comments around this, first adding yet another IR in the kernel
> >and another JIT engine to map that IR on to eBPF or hardware vendor X
> >doesn't get me excited. Its really much easier to write these as backend
> >objects in LLVM. Not saying it can't be done just saying it is easier
> >in LLVM. Also we already have the LLVM code for P4 to LLVM-IR to eBPF.
> >In the end this would be a 

Re: Let's do P4

2016-10-29 Thread Thomas Graf
On 10/29/16 at 01:28pm, Jiri Pirko wrote:
> Sat, Oct 29, 2016 at 01:15:48PM CEST, tg...@suug.ch wrote:
> >So given the SKIP_SW flag, the in-kernel compiler is optional anyway.
> >Why even risk including a possibly incomplete compiler? Older kernels
> >must be capable of running along newer hardware as long as eBPF can
> >represent the software path. Having to upgrade to latest and greatest
> >kernels is not an option for most people so they would simply have to
> >fall back to SKIP_SW and do it in user space anyway.
> 
> The thing is, if we needo to offload something, it needs to be
> implemented in kernel first. Also, I believe that it is good to have
> in-kernel p4 engine for testing and development purposes.

You lost me now :-) In an earlier email you said:

> It can be the other way around. The p4>ebpf compiler won't be complete
> at the beginning so it is possible that HW could provide more features.
> I don't think it is a problem. With SKIP_SW and SKIP_HW flags in TC,
> the user can set different program to each. I think in real life, that
> would be the most common case anyway.

If you allow to SKIP_SW and set different programs each to address
this, then how is this any different.

I completely agree that kernel must be able to provide the same
functionality as HW with optional additional capabilities on top so
the HW can always bail out and punt to SW.

[...]

> >I'm not seeing how either of them is more or less variable. The main
> >difference is whether to require configuring a single cls with both
> >p4ast + bpf or two separate cls, one for each. I'd prefer the single
> >cls approach simply because it is cleaner wither regard to offload
> >directly off bpf vs off p4ast.
> 
> That's the bundle that you asked me to forget earlier in this email? :)

I thought you referred to the "store in same object file" as bundle.
I don't really care about that. What I care about is a single way to
configure this that works for both ASIC and non-ASIC hardware.

> >My main point is to not include a IR to eBPF compiler in the kernel
> >and let user space handle this instead.
> 
> It we do it as you describe, we would be using 2 different APIs for
> offloaded and non-offloaded path. I don't believe it is acceptable as
> the offloaded features has to have kernel implementation. Therefore, I
> believe that p4ast as a kernel API is the only possible option.

Yes, the kernel has the SW implementation in eBPF. I thought that is
what you propose as well. The only difference is whether to generate
that eBPF in kernel or user space.

Not sure I understand the multiple APIs point for offload vs
non-offload. There is a single API: tc. Both models require the user
to provide additional metadata to allow programming ASIC HW: p4ast
IR or whatever we agree on.


Re: [PATCH net-next RFC WIP] Patch for XDP support for virtio_net

2016-10-29 Thread Thomas Graf
On 10/28/16 at 08:51pm, Shrijeet Mukherjee wrote:
> Generally agree, but SRIOV nics with multiple queues can end up in a bad
> spot if each buffer was 4K right ? I see a specific page pool to be used
> by queues which are enabled for XDP as the easiest to swing solution that
> way the memory overhead can be restricted to enabled queues and shared
> access issues can be restricted to skb's using that pool no ?

Isn't this clearly a must anyway? I may be missing something
fundamental here so please enlighten me :-)

If we dedicate a page per packet, that could translate to 14M*4K worth
of memory being mapped per second for just a 10G NIC under DoS attack.
How can one protect such as system? Is the assumption that we can always
drop such packets quickly enough before we start dropping randomly due
to memory pressure? If a handshake is required to determine validity
of a packet then that is going to be difficult.


Re: Let's do P4

2016-10-29 Thread Thomas Graf
On 10/29/16 at 12:10pm, Jiri Pirko wrote:
> Sat, Oct 29, 2016 at 11:39:05AM CEST, tg...@suug.ch wrote:
> >On 10/29/16 at 09:53am, Jiri Pirko wrote:
> >> 3) Expose the p4ast in-kernel interpreter to userspace
> >>As the easiest way I see in to introduce a new TC classifier cls_p4.
> >> 
> >>This can work in a very similar way cls_bpf is:
> >>$ tc filter add dev eth0 ingress p4 da ast example.ast
> >> 
> >>The TC cls_p4 will be also used for runtime table manipulation.
> >
> >I think this is a great model for the case where HW can provide all
> >of the required capabilities. Thinking about the case where HW
> >provides a subset and SW provides an extended version, i.e. the
> >reality we live in for hosts with ASIC NICs ;-) The hand off point
> >requires some understanding between p4ast and eBPF.
> 
> It can be the other way around. The p4>ebpf compiler won't be complete
> at the beginning so it is possible that HW could provide more features.
> I don't think it is a problem. With SKIP_SW and SKIP_HW flags in TC,
> the user can set different program to each. I think in real life, that
> would be the most common case anyway.

So given the SKIP_SW flag, the in-kernel compiler is optional anyway.
Why even risk including a possibly incomplete compiler? Older kernels
must be capable of running along newer hardware as long as eBPF can
represent the software path. Having to upgrade to latest and greatest
kernels is not an option for most people so they would simply have to
fall back to SKIP_SW and do it in user space anyway.

> >Therefore another idea would be to use cls_bpf directly for this. The
> >p4ast IR could be stored in a separate ELF section in the same object
> >file with an existing eBPF program. The p4ast IR will match the
> 
> I don't like this idea. The kernel API should be clean and simple.
> Bundling p4ast with bpf.o code, so the bpf.o is for kernel and p4ast is
> for driver does not look clean at all. The bundle does not make really
> sense as the programs may do different things for BPF and p4.

I don't care strongly for the bundle. Let's forget about it for now.

> Plus, it's up to user to set this up like he wants. If he wants SW
> processing by BPF and at the same time HW processing by P4, he will use:
> cls_bpf instance with SKIP_HW
> cls_p4 instance with SKIP_SW.
> 
> This is much more variable, clean and non-confusing approach, I believe.

Non ASIC hardware will want to do offload based on BPF though so your
model would require the user to be aware of what is the preferred
model for his hardware and then either load a cls_bpf only to work
with a Netronome NIC or a cls_p4 + cls_bpf to work with an ASIC NIC,
correct?

I'm not seeing how either of them is more or less variable. The main
difference is whether to require configuring a single cls with both
p4ast + bpf or two separate cls, one for each. I'd prefer the single
cls approach simply because it is cleaner wither regard to offload
directly off bpf vs off p4ast.

My main point is to not include a IR to eBPF compiler in the kernel
and let user space handle this instead.


Re: Let's do P4

2016-10-29 Thread Thomas Graf
On 10/29/16 at 09:53am, Jiri Pirko wrote:
> Hi all.
> 
> The network world is divided into 2 general types of hw:
> 1) network ASICs - network specific silicon, containing things like TCAM
>These ASICs are suitable to be programmed by P4.
> 2) network processors - basically a general purpose CPUs
>These processors are suitable to be programmed by eBPF.
> 
> I believe that by now, the most people came to a conclusion that it is
> very difficult to handle both types by either P4 or eBPF. And since
> eBPF is part of the kernel, I would like to introduce P4 into kernel
> as well. Here's a plan:

For reference, last time I remember we discussed this in the BPF
offload context:
http://www.spinics.net/lists/netdev/msg356178.html

> 1) Define P4 intermediate representation
>I cannot imagine loading P4 program (c-like syntax text file) into
>kernel as is. That means that as the first step, we need find some
>intermediate representation. I can imagine someting in a form of AST,
>call it "p4ast". I don't really know how to do this exactly though,
>it's just an idea.
> 
>In the end there would be a userspace precompiler for this:
>$ makep4ast example.p4 example.ast
> 
> 2) Implement p4ast in-kernel interpreter 
>A kernel module which takes a p4ast and emulates the pipeline.
>This can be implemented from scratch. Or, p4ast could be compiled
>to eBPF. I know there are already couple of p4>eBPF compilers.
>Not sure how feasible it would be to put this compiler in kernel.

+1 to using eBPF for emulation. Maybe the compiler doesn't need to be
in the kernel and user space can compile and provide the emulated
pipeline in eBPF directly. See next paragraph for an example where
this could be useful.

> 3) Expose the p4ast in-kernel interpreter to userspace
>As the easiest way I see in to introduce a new TC classifier cls_p4.
> 
>This can work in a very similar way cls_bpf is:
>$ tc filter add dev eth0 ingress p4 da ast example.ast
> 
>The TC cls_p4 will be also used for runtime table manipulation.

I think this is a great model for the case where HW can provide all
of the required capabilities. Thinking about the case where HW
provides a subset and SW provides an extended version, i.e. the
reality we live in for hosts with ASIC NICs ;-) The hand off point
requires some understanding between p4ast and eBPF.

Therefore another idea would be to use cls_bpf directly for this. The
p4ast IR could be stored in a separate ELF section in the same object
file with an existing eBPF program. The p4ast IR will match the
eBPF prog if capabilities of HW and SW match. If HW is limited, the
p4ast IR represents what the HW can do plus how to pass it to SW. The
eBPF prog contains whatever logic is required to take over if the HW
either bailed out or handed over deliberately. Then on top, all the
missing pieces of functionality which can only be performed in SW.

tc then loads 1) eBPF maps and prog through bpf() syscall
  2) cls_bpf filter with p4ast IR plus ref to prog and
 maps

> 4) Offload p4ast programs into hardware
>The same p4ast program representation will be passed down
>to drivers via existing TC offloading way - ndo_setup_tc.
>Drivers will then parse it and setup the hardware
>accordingly. Driver will also have possibility to error out
>in case it does not support some requested feature.


[PATCH net-next] bpf: Print function name in addition to function id

2016-10-27 Thread Thomas Graf
The verifier currently prints raw function ids when printing CALL
instructions or when complaining:

5: (85) call 23
unknown func 23

print a meaningful function name instead:

5: (85) call bpf_redirect#23
unknown func bpf_redirect#23

Moves the function documentation to a single comment and renames all
helpers names in the list to conform to the bpf_ prefix notation so
they can be greped in the kernel source.

Signed-off-by: Thomas Graf <tg...@suug.ch>
Acked-by: Daniel Borkmann <dan...@iogearbox.net>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 include/uapi/linux/bpf.h | 574 ---
 kernel/bpf/verifier.c|  35 ++-
 2 files changed, 316 insertions(+), 293 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 374ef58..e2f38e0 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -143,297 +143,301 @@ union bpf_attr {
};
 } __attribute__((aligned(8)));
 
+/* BPF helper function descriptions:
+ *
+ * void *bpf_map_lookup_elem(, )
+ * Return: Map value or NULL
+ *
+ * int bpf_map_update_elem(, , , flags)
+ * Return: 0 on success or negative error
+ *
+ * int bpf_map_delete_elem(, )
+ * Return: 0 on success or negative error
+ *
+ * int bpf_probe_read(void *dst, int size, void *src)
+ * Return: 0 on success or negative error
+ *
+ * u64 bpf_ktime_get_ns(void)
+ * Return: current ktime
+ *
+ * int bpf_trace_printk(const char *fmt, int fmt_size, ...)
+ * Return: length of buffer written or negative error
+ *
+ * u32 bpf_prandom_u32(void)
+ * Return: random value
+ *
+ * u32 bpf_raw_smp_processor_id(void)
+ * Return: SMP processor ID
+ *
+ * int bpf_skb_store_bytes(skb, offset, from, len, flags)
+ * store bytes into packet
+ * @skb: pointer to skb
+ * @offset: offset within packet from skb->mac_header
+ * @from: pointer where to copy bytes from
+ * @len: number of bytes to store into packet
+ * @flags: bit 0 - if true, recompute skb->csum
+ * other bits - reserved
+ * Return: 0 on success or negative error
+ *
+ * int bpf_l3_csum_replace(skb, offset, from, to, flags)
+ * recompute IP checksum
+ * @skb: pointer to skb
+ * @offset: offset within packet where IP checksum is located
+ * @from: old value of header field
+ * @to: new value of header field
+ * @flags: bits 0-3 - size of header field
+ * other bits - reserved
+ * Return: 0 on success or negative error
+ *
+ * int bpf_l4_csum_replace(skb, offset, from, to, flags)
+ * recompute TCP/UDP checksum
+ * @skb: pointer to skb
+ * @offset: offset within packet where TCP/UDP checksum is located
+ * @from: old value of header field
+ * @to: new value of header field
+ * @flags: bits 0-3 - size of header field
+ * bit 4 - is pseudo header
+ * other bits - reserved
+ * Return: 0 on success or negative error
+ *
+ * int bpf_tail_call(ctx, prog_array_map, index)
+ * jump into another BPF program
+ * @ctx: context pointer passed to next program
+ * @prog_array_map: pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY
+ * @index: index inside array that selects specific program to run
+ * Return: 0 on success or negative error
+ *
+ * int bpf_clone_redirect(skb, ifindex, flags)
+ * redirect to another netdev
+ * @skb: pointer to skb
+ * @ifindex: ifindex of the net device
+ * @flags: bit 0 - if set, redirect to ingress instead of egress
+ * other bits - reserved
+ * Return: 0 on success or negative error
+ *
+ * u64 bpf_get_current_pid_tgid(void)
+ * Return: current->tgid << 32 | current->pid
+ *
+ * u64 bpf_get_current_uid_gid(void)
+ * Return: current_gid << 32 | current_uid
+ *
+ * int bpf_get_current_comm(char *buf, int size_of_buf)
+ * stores current->comm into buf
+ * Return: 0 on success or negative error
+ *
+ * u32 bpf_get_cgroup_classid(skb)
+ * retrieve a proc's classid
+ * @skb: pointer to skb
+ * Return: classid if != 0
+ *
+ * int bpf_skb_vlan_push(skb, vlan_proto, vlan_tci)
+ * Return: 0 on success or negative error
+ *
+ * int bpf_skb_vlan_pop(skb)
+ * Return: 0 on success or negative error
+ *
+ * int bpf_skb_get_tunnel_key(skb, key, size, flags)
+ * int bpf_skb_set_tunnel_key(skb, key, size, flags)
+ * retrieve or populate tunnel metadata
+ * @skb: pointer to skb
+ * @key: pointer to 'struct bpf_tunnel_key'
+ * @size: size of 'struct bpf_tunnel_key'
+ * @flags: room for future extensions
+ * Return: 0 on success or negative error
+ *
+ * u64 bpf_perf_event_read(, index)
+ * Return: Number events read or error code
+ *
+ * int bpf_redirect(ifindex, flags)
+ * redirect to another netdev
+ * @ifindex: ifindex of the net device
+ * @flags: bit 0 - if set, redirect to ingress instead of

Re: [PATCH net-next 2/3] bpf: Add new cgroups prog type to enable sock modifications

2016-10-26 Thread Thomas Graf
On 10/26/16 at 10:08am, David Ahern wrote:
> On 10/26/16 2:41 AM, Thomas Graf wrote:
> > On 10/25/16 at 03:30pm, David Ahern wrote:
> >> @@ -171,6 +177,9 @@ int __cgroup_bpf_run_filter(struct sock *sk,
> >>case BPF_CGROUP_INET_EGRESS:
> >>ret = __cgroup_bpf_run_filter_skb(skb, prog);
> >>break;
> >> +  case BPF_CGROUP_INET_SOCK_CREATE:
> >> +  ret = __cgroup_bpf_run_filter_sk_create(sk, prog);
> >> +  break;
> >>/* make gcc happy else complains about missing enum value */
> >>default:
> >>return 0;
> > 
> > Thinking further ahead of your simple example. Instead of adding yet
> > another prog type for the same hook, we can make this compatible with
> > BPF_CGROUP_INET_EGRESS instead which would then provide a ctx which
> > contains both, the sk and skb.
> > 
> > The ctx concept is very flexible. We can keep the existing dummy skb
> > representation and add sk_ fields which are only valid for BPF at
> > socket layer, e.g skb->sk_bound_dev_if would translate to
> > sk->bound_dev_if.
> > 
> 
> It's an odd user semantic to me to put sock elements into the shadow sk_buff 
> and to reuse BPF_CGROUP_INET_EGRESS. 
> 
> I can drop the _CREATE and just make it BPF_CGROUP_INET_SOCK so it works for 
> any sock modification someone wants to add -- e.g., the port binding use case.

Your specific sk_alloc hook won't have an skb but the majority of
socket BPF programs will want to see both skb and sk. It is not
ideal to introduce a new bpf_prog_type for every minimal difference
in capability if the majority of capabilities and restrictions are
shared. Instead, the subtype approach was implemented by the Landlock
series looks much cleaner to me.

I think we should think about how to do this properly for all BPF
programs which operate in socket cgroup context.


Re: [PATCH net-next] netlink: Add nla_memdup() to wrap kmemdup() use on nlattr

2016-10-26 Thread Thomas Graf
On 10/26/16 at 10:59am, Johannes Berg wrote:
> 
> >  /**
> > + * nla_memdup - duplicate attribute memory (kmemdup)
> > + * @src: netlink attribute to duplicate from
> > + * @gfp: GFP mask
> 
> Actually, is there any point in passing a GFP mask? None of the current
> users need it, and it seems fairly unlikely to be needed since this is
> typically used on the netlink input path, where you surely shouldn't
> need GFP_ATOMIC or so?

I'm fine either way. I didn't want to make assumptions which need
later changes. It's not hurting either and the function prototype
is very small.


Re: concurrent rhashtable test failure

2016-10-26 Thread Thomas Graf
On 10/24/16 at 02:11pm, Geert Uytterhoeven wrote:
> Hi Phil,
> 
> On m68k/ARAnyM, test_rhashtable fails with:
> 
> Test failed: thread 0 returned: -4
> 
> (-4 = -EINTR)

The error is returned by kthread_stop(), I suspect we are running into
this:

static int kthread(void *_create)
{
[...]
complete(done);
schedule();

ret = -EINTR;

if (!test_bit(KTHREAD_SHOULD_STOP, )) {
__kthread_parkme();
ret = threadfn(data);
}
/* we can't just return, we must preserve "self" on stack */
do_exit(ret);
}


[PATCH net-next] netlink: Add nla_memdup() to wrap kmemdup() use on nlattr

2016-10-26 Thread Thomas Graf
Wrap several common instances of:
kmemdup(nla_data(attr), nla_len(attr), GFP_KERNEL);

Signed-off-by: Thomas Graf <tg...@suug.ch>
---
 include/net/netlink.h  | 10 ++
 net/sched/act_bpf.c|  4 +---
 net/sched/cls_bpf.c|  4 +---
 net/wireless/nl80211.c |  3 +--
 4 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/include/net/netlink.h b/include/net/netlink.h
index 254a0fc..a34f53a 100644
--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -1191,6 +1191,16 @@ static inline struct in6_addr nla_get_in6_addr(const 
struct nlattr *nla)
 }
 
 /**
+ * nla_memdup - duplicate attribute memory (kmemdup)
+ * @src: netlink attribute to duplicate from
+ * @gfp: GFP mask
+ */
+static inline void *nla_memdup(const struct nlattr *src, gfp_t gfp)
+{
+   return kmemdup(nla_data(src), nla_len(src), gfp);
+}
+
+/**
  * nla_nest_start - Start a new level of nested attributes
  * @skb: socket buffer to add attributes to
  * @attrtype: attribute type of container
diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index 1d39600..9ff06cf 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -226,9 +226,7 @@ static int tcf_bpf_init_from_efd(struct nlattr **tb, struct 
tcf_bpf_cfg *cfg)
return PTR_ERR(fp);
 
if (tb[TCA_ACT_BPF_NAME]) {
-   name = kmemdup(nla_data(tb[TCA_ACT_BPF_NAME]),
-  nla_len(tb[TCA_ACT_BPF_NAME]),
-  GFP_KERNEL);
+   name = nla_memdup(tb[TCA_ACT_BPF_NAME], GFP_KERNEL);
if (!name) {
bpf_prog_put(fp);
return -ENOMEM;
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index bb1d5a4..52dc85a 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -369,9 +369,7 @@ static int cls_bpf_prog_from_efd(struct nlattr **tb, struct 
cls_bpf_prog *prog,
return PTR_ERR(fp);
 
if (tb[TCA_BPF_NAME]) {
-   name = kmemdup(nla_data(tb[TCA_BPF_NAME]),
-  nla_len(tb[TCA_BPF_NAME]),
-  GFP_KERNEL);
+   name = nla_memdup(tb[TCA_BPF_NAME], GFP_KERNEL);
if (!name) {
bpf_prog_put(fp);
return -ENOMEM;
diff --git a/net/wireless/nl80211.c b/net/wireless/nl80211.c
index c510810..7bfe1e0 100644
--- a/net/wireless/nl80211.c
+++ b/net/wireless/nl80211.c
@@ -10638,8 +10638,7 @@ static int handle_nan_filter(struct nlattr *attr_filter,
 
i = 0;
nla_for_each_nested(attr, attr_filter, rem) {
-   filter[i].filter = kmemdup(nla_data(attr), nla_len(attr),
-  GFP_KERNEL);
+   filter[i].filter = nla_memdup(attr, GFP_KERNEL);
filter[i].len = nla_len(attr);
i++;
}
-- 
2.7.4



Re: [PATCH net-next 2/3] bpf: Add new cgroups prog type to enable sock modifications

2016-10-26 Thread Thomas Graf
On 10/25/16 at 03:30pm, David Ahern wrote:
> @@ -171,6 +177,9 @@ int __cgroup_bpf_run_filter(struct sock *sk,
>   case BPF_CGROUP_INET_EGRESS:
>   ret = __cgroup_bpf_run_filter_skb(skb, prog);
>   break;
> + case BPF_CGROUP_INET_SOCK_CREATE:
> + ret = __cgroup_bpf_run_filter_sk_create(sk, prog);
> + break;
>   /* make gcc happy else complains about missing enum value */
>   default:
>   return 0;

Thinking further ahead of your simple example. Instead of adding yet
another prog type for the same hook, we can make this compatible with
BPF_CGROUP_INET_EGRESS instead which would then provide a ctx which
contains both, the sk and skb.

The ctx concept is very flexible. We can keep the existing dummy skb
representation and add sk_ fields which are only valid for BPF at
socket layer, e.g skb->sk_bound_dev_if would translate to
sk->bound_dev_if.


[PATCH net-next] lwt: Remove unused len field

2016-10-21 Thread Thomas Graf
The field is initialized by ILA and MPLS but never used. Remove it.

Signed-off-by: Thomas Graf <tg...@suug.ch>
---
 include/net/lwtunnel.h   | 3 +--
 net/ipv6/ila/ila_lwt.c   | 4 +---
 net/mpls/mpls_iptunnel.c | 5 +
 3 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h
index 67d235f..82e76fe 100644
--- a/include/net/lwtunnel.h
+++ b/include/net/lwtunnel.h
@@ -24,11 +24,10 @@ enum {
 struct lwtunnel_state {
__u16   type;
__u16   flags;
+   __u16   headroom;
atomic_trefcnt;
int (*orig_output)(struct net *net, struct sock *sk, struct 
sk_buff *skb);
int (*orig_input)(struct sk_buff *);
-   int len;
-   __u16   headroom;
struct  rcu_head rcu;
__u8data[0];
 };
diff --git a/net/ipv6/ila/ila_lwt.c b/net/ipv6/ila/ila_lwt.c
index c7a39d0..a7bc54a 100644
--- a/net/ipv6/ila/ila_lwt.c
+++ b/net/ipv6/ila/ila_lwt.c
@@ -122,7 +122,6 @@ static int ila_build_state(struct net_device *dev, struct 
nlattr *nla,
struct ila_lwt *ilwt;
struct ila_params *p;
struct nlattr *tb[ILA_ATTR_MAX + 1];
-   size_t encap_len = sizeof(*ilwt);
struct lwtunnel_state *newts;
const struct fib6_config *cfg6 = cfg;
struct ila_addr *iaddr;
@@ -155,7 +154,7 @@ static int ila_build_state(struct net_device *dev, struct 
nlattr *nla,
if (!tb[ILA_ATTR_LOCATOR])
return -EINVAL;
 
-   newts = lwtunnel_state_alloc(encap_len);
+   newts = lwtunnel_state_alloc(sizeof(*ilwt));
if (!newts)
return -ENOMEM;
 
@@ -166,7 +165,6 @@ static int ila_build_state(struct net_device *dev, struct 
nlattr *nla,
return ret;
}
 
-   newts->len = encap_len;
p = ila_params_lwtunnel(newts);
 
p->locator.v64 = (__force __be64)nla_get_u64(tb[ILA_ATTR_LOCATOR]);
diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c
index cf52cf3..2f7ccd9 100644
--- a/net/mpls/mpls_iptunnel.c
+++ b/net/mpls/mpls_iptunnel.c
@@ -133,7 +133,6 @@ static int mpls_build_state(struct net_device *dev, struct 
nlattr *nla,
struct mpls_iptunnel_encap *tun_encap_info;
struct nlattr *tb[MPLS_IPTUNNEL_MAX + 1];
struct lwtunnel_state *newts;
-   int tun_encap_info_len;
int ret;
 
ret = nla_parse_nested(tb, MPLS_IPTUNNEL_MAX, nla,
@@ -144,13 +143,11 @@ static int mpls_build_state(struct net_device *dev, 
struct nlattr *nla,
if (!tb[MPLS_IPTUNNEL_DST])
return -EINVAL;
 
-   tun_encap_info_len = sizeof(*tun_encap_info);
 
-   newts = lwtunnel_state_alloc(tun_encap_info_len);
+   newts = lwtunnel_state_alloc(sizeof(*tun_encap_info));
if (!newts)
return -ENOMEM;
 
-   newts->len = tun_encap_info_len;
tun_encap_info = mpls_lwtunnel_encap(newts);
ret = nla_get_labels(tb[MPLS_IPTUNNEL_DST], MAX_NEW_LABELS,
 _encap_info->labels, tun_encap_info->label);
-- 
2.7.4



Re: [PATCH net] ila: Fix tailroom allocation of lwtstate

2016-10-20 Thread Thomas Graf
On 10/20/16 at 11:20am, David Miller wrote:
> This patch only applies to net-next, so I've applied it there.
> 
> Please explain how I should handle 'net' and -stable.

The presence of ila_params_lwtunnel() in 'net' lead me to believe that
'net' is affected as well. It is not. Applying to 'net-next' only is
fine.


[PATCH net] ila: Fix tailroom allocation of lwtstate

2016-10-19 Thread Thomas Graf
Tailroom is supposed to be of length sizeof(struct ila_lwt) but
sizeof(struct ila_params) is currently allocated.

This leads to the dst_cache and connected member of ila_lwt being
referenced out of bounds.

struct ila_lwt {
struct ila_params p;
struct dst_cache dst_cache;
u32 connected : 1;
};

Fixes: 65d7ab8de582 ("net: Identifier Locator Addressing module")
Signed-off-by: Thomas Graf <tg...@suug.ch>
Acked-by: Daniel Borkmann <dan...@iogearbox.net>
---
 -stable candidate

 net/ipv6/ila/ila_lwt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ila/ila_lwt.c b/net/ipv6/ila/ila_lwt.c
index 9fafba6..c7a39d0 100644
--- a/net/ipv6/ila/ila_lwt.c
+++ b/net/ipv6/ila/ila_lwt.c
@@ -122,7 +122,7 @@ static int ila_build_state(struct net_device *dev, struct 
nlattr *nla,
struct ila_lwt *ilwt;
struct ila_params *p;
struct nlattr *tb[ILA_ATTR_MAX + 1];
-   size_t encap_len = sizeof(*p);
+   size_t encap_len = sizeof(*ilwt);
struct lwtunnel_state *newts;
const struct fib6_config *cfg6 = cfg;
struct ila_addr *iaddr;
-- 
2.7.4



Re: [RFC v3 06/22] landlock: Add LSM hooks

2016-10-19 Thread Thomas Graf
On 09/14/16 at 09:23am, Mickaël Salaün wrote:
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 9aa01d9d3d80..36c3e482239c 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -85,6 +85,8 @@ enum bpf_arg_type {
>  
>   ARG_PTR_TO_CTX, /* pointer to context */
>   ARG_ANYTHING,   /* any (initialized) argument is ok */
> +
> + ARG_PTR_TO_STRUCT_FILE, /* pointer to struct file */

This should go into patch 7 I guess?

> +void __init landlock_add_hooks(void)
> +{
> + pr_info("landlock: Becoming ready for sandboxing\n");
> + security_add_hooks(landlock_hooks, ARRAY_SIZE(landlock_hooks));
> +}

Can we add the hooks when we load the first BPF program for a hook? That
would also allow to not make this conditional on a new config option
which all all distros have to enable anyway.

I would really like to see this patch split into the LSM part which
allows running BPF progs at LSM and your specific sandboxing use case
which requires the new BPF helpers, new reg type, etc.


Re: [RFC v3 04/22] bpf: Set register type according to is_valid_access()

2016-10-19 Thread Thomas Graf
On 09/14/16 at 09:23am, Mickaël Salaün wrote:
> This fix a pointer leak when an unprivileged eBPF program read a pointer
> value from the context. Even if is_valid_access() returns a pointer
> type, the eBPF verifier replace it with UNKNOWN_VALUE. The register
> value containing an address is then allowed to leak. Moreover, this
> prevented unprivileged eBPF programs to use functions with (legitimate)
> pointer arguments.
> 
> This bug was not a problem until now because the only unprivileged eBPF
> program allowed is of type BPF_PROG_TYPE_SOCKET_FILTER and all the types
> from its context are UNKNOWN_VALUE.
> 
> Signed-off-by: Mickaël Salaün 
> Fixes: 969bf05eb3ce ("bpf: direct packet access")
> Cc: Alexei Starovoitov 
> Cc: Daniel Borkmann 

Can you post this fix separately? It's valid and needed outside of the
scope of this series.


Re: [RFC v3 05/22] bpf,landlock: Add eBPF program subtype and is_valid_subtype() verifier

2016-10-19 Thread Thomas Graf
On 09/14/16 at 09:23am, Mickaël Salaün wrote:
> @@ -155,6 +163,7 @@ union bpf_attr {
>   __u32   log_size;   /* size of user buffer */
>   __aligned_u64   log_buf;/* user supplied buffer */
>   __u32   kern_version;   /* checked when 
> prog_type=kprobe */
> + union bpf_prog_subtype prog_subtype;/* checked when 
> prog_type=landlock */

The comment seems bogus, this is not landlock specific.


[PATCH net-next] bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers

2016-10-18 Thread Thomas Graf
A BPF program is required to check the return register of a
map_elem_lookup() call before accessing memory. The verifier keeps
track of this by converting the type of the result register from
PTR_TO_MAP_VALUE_OR_NULL to PTR_TO_MAP_VALUE after a conditional
jump ensures safety. This check is currently exclusively performed
for the result register 0.

In the event the compiler reorders instructions, BPF_MOV64_REG
instructions may be moved before the conditional jump which causes
them to keep their type PTR_TO_MAP_VALUE_OR_NULL to which the
verifier objects when the register is accessed:

0: (b7) r1 = 10
1: (7b) *(u64 *)(r10 -8) = r1
2: (bf) r2 = r10
3: (07) r2 += -8
4: (18) r1 = 0x59c0
6: (85) call 1
7: (bf) r4 = r0
8: (15) if r0 == 0x0 goto pc+1
 R0=map_value(ks=8,vs=8) R4=map_value_or_null(ks=8,vs=8) R10=fp
9: (7a) *(u64 *)(r4 +0) = 0
R4 invalid mem access 'map_value_or_null'

This commit extends the verifier to keep track of all identical
PTR_TO_MAP_VALUE_OR_NULL registers after a map_elem_lookup() by
assigning them an ID and then marking them all when the conditional
jump is observed.

Signed-off-by: Thomas Graf <tg...@suug.ch>
Reviewed-by: Josef Bacik <jba...@fb.com>
Acked-by: Daniel Borkmann <dan...@iogearbox.net>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 include/linux/bpf_verifier.h|  2 +-
 kernel/bpf/verifier.c   | 61 +---
 tools/testing/selftests/bpf/test_verifier.c | 72 +
 3 files changed, 118 insertions(+), 17 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 7035b99..ac5b393 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -23,13 +23,13 @@ struct bpf_reg_state {
 * result in a bad access.
 */
u64 min_value, max_value;
+   u32 id;
union {
/* valid when type == CONST_IMM | PTR_TO_STACK | UNKNOWN_VALUE 
*/
s64 imm;
 
/* valid when type == PTR_TO_PACKET* */
struct {
-   u32 id;
u16 off;
u16 range;
};
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 99a7e5b..846d7ce 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -212,9 +212,10 @@ static void print_verifier_state(struct bpf_verifier_state 
*state)
else if (t == CONST_PTR_TO_MAP || t == PTR_TO_MAP_VALUE ||
 t == PTR_TO_MAP_VALUE_OR_NULL ||
 t == PTR_TO_MAP_VALUE_ADJ)
-   verbose("(ks=%d,vs=%d)",
+   verbose("(ks=%d,vs=%d,id=%u)",
reg->map_ptr->key_size,
-   reg->map_ptr->value_size);
+   reg->map_ptr->value_size,
+   reg->id);
if (reg->min_value != BPF_REGISTER_MIN_RANGE)
verbose(",min_value=%llu",
(unsigned long long)reg->min_value);
@@ -447,6 +448,7 @@ static void mark_reg_unknown_value(struct bpf_reg_state 
*regs, u32 regno)
 {
BUG_ON(regno >= MAX_BPF_REG);
regs[regno].type = UNKNOWN_VALUE;
+   regs[regno].id = 0;
regs[regno].imm = 0;
 }
 
@@ -1252,6 +1254,7 @@ static int check_call(struct bpf_verifier_env *env, int 
func_id)
return -EINVAL;
}
regs[BPF_REG_0].map_ptr = meta.map_ptr;
+   regs[BPF_REG_0].id = ++env->id_gen;
} else {
verbose("unknown return type %d of func %d\n",
fn->ret_type, func_id);
@@ -1644,8 +1647,7 @@ static int check_alu_op(struct bpf_verifier_env *env, 
struct bpf_insn *insn)
insn->src_reg);
return -EACCES;
}
-   regs[insn->dst_reg].type = UNKNOWN_VALUE;
-   regs[insn->dst_reg].map_ptr = NULL;
+   mark_reg_unknown_value(regs, insn->dst_reg);
}
} else {
/* case: R = imm
@@ -1907,6 +1909,38 @@ static void reg_set_min_max_inv(struct bpf_reg_state 
*true_reg,
check_reg_overflow(true_reg);
 }
 
+static void mark_map_reg(struct bpf_reg_state *regs, u32 regno, u32 id,
+enum bpf_reg_type type)
+{
+   struct bpf_reg_state *reg = [regno];
+
+   if (reg->type == PTR_TO_MAP_VALUE_OR_NULL && reg->id == id) {
+   reg->type = type;
+   if (type == UNKNOWN_VALUE)
+   mark_reg_unknown_value(regs, regno);
+   }
+}
+
+/* The logic is similar to find_good_pkt_

Re: [PATCH v6 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs

2016-09-22 Thread Thomas Graf
On 09/22/16 at 11:21am, Pablo Neira Ayuso wrote:
> I have a hard time to buy this new specific hook, I think we should
> shift focus of this debate, this is my proposal to untangle this:
> 
> You add a net/netfilter/nft_bpf.c expression that allows you to run
> bpf programs from nf_tables. This expression can either run bpf
> programs in a similar fashion to tc+bpf or run the bpf program that
> you have attached to the cgroup.

So for every packet processed, you want to require the user to load
and run a (unJITed) nft program acting as a wrapper to run a JITed
BPF program? What it the benefit of this model compared to what Daniel
is proposing? The hooking point is the same. This only introduces
additional per packet overhead in the fast path. Am I missing something?


Re: [PATCH RFC 1/3] xdp: Infrastructure to generalize XDP

2016-09-21 Thread Thomas Graf
On 09/21/16 at 11:50am, Tom Herbert wrote:
> 50 lines in one driver is not a big deal, 50 lines in a hundred
> drivers is! I learned this lesson in BQL which was well abstracted out
> to be minimally invasive but we still saw many issues because of the
> pecularities of different drivers.

You want to enable XDP in a hundred drivers? Are you planning to
deploy ISA NIC based ILA routers? ;-)


Re: [PATCH v6 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs

2016-09-21 Thread Thomas Graf
On 09/21/16 at 05:45pm, Pablo Neira Ayuso wrote:
> On Tue, Sep 20, 2016 at 06:43:35PM +0200, Daniel Mack wrote:
> > The point is that from an application's perspective, restricting the
> > ability to bind a port and dropping packets that are being sent is a
> > very different thing. Applications will start to behave differently if
> > they can't bind to a port, and that's something we do not want to happen.
> 
> What is exactly the problem? Applications are not checking for return
> value from bind? They should be fixed. If you want to collect
> statistics, I see no reason why you couldn't collect them for every
> EACCESS on each bind() call.

It's not about applications not checking the return value of bind().
Unfortunately, many applications (or the respective libraries they use)
retry on connect() failure but handle bind() errors as a hard failure
and exit. Yes, it's an application or library bug but these
applications have very specific exceptions how something fails.
Sometimes even going from drop to RST will break applications.

Paranoia speaking: by returning errors where no error was returned
before, undefined behaviour occurs. In Murphy speak: things break.

This is given and we can't fix it from the kernel side. Returning at
system call level has many benefits but it's not always an option.

Adding the late hook does not prevent filtering at socket layer to
also be added. I think we need both.


Re: [PATCH RFC 1/3] xdp: Infrastructure to generalize XDP

2016-09-21 Thread Thomas Graf
On 09/21/16 at 07:19am, Tom Herbert wrote:
> certain design that because of constraints on one kernel interface. As
> a kernel developer I want flexibility on how we design and implement
> things!

Perfectly valid argument. I reviewed your ILA changes and did not
object to them.


> I think there are two questions that this patch set poses for the
> community wrt XDP:
> 
> #1: Should we allow alternate code to run in XDP other than BPF?
> #2: If #1 is true what is the best way to implement that?
> 
> If the answer to #1 is "no" then the answer to #2 is irrelevant. So
> with this RFC I'm hoping we can come the agreement on questions #1.

I'm not opposed to running non-BPF code at XDP. I'm against adding
a linked list of hook consumers.

Would anyone require to run XDP-BPF in combination ILA? Or XDP-BPF
in combination with a potential XDP-nftables? We don't know yet I
guess.

Maybe exclusive access to the hook for one consumer as selected by
the user is good enough.

If that is not good enough: BPF (and potentially nftables in the
future) could provide means to perform a selection process where a
helper call can run another XDP prog or return a verdict to trigger
another XDP prog. Definitely more flexible and faster than a linear
list doing  if, else if, else if, else if, ...


Re: [PATCH v4 net-next 00/16] tcp: BBR congestion control algorithm

2016-09-21 Thread Thomas Graf
On 09/21/16 at 07:53am, Neal Cardwell wrote:
> On Wed, Sep 21, 2016 at 12:44 AM, David Miller  wrote:
> > From: Neal Cardwell 
> > Date: Mon, 19 Sep 2016 23:39:07 -0400
> >
> >> tcp: BBR congestion control algorithm
> >
> > Series applied, thanks Neal.
> 
> Thanks, David!

Let me just say that this is super awesome. Thank you to everybody who
was involved!


Re: [PATCH RFC 1/3] xdp: Infrastructure to generalize XDP

2016-09-21 Thread Thomas Graf
On 09/20/16 at 04:59pm, Tom Herbert wrote:
> Well, need to measure to ascertain the cost. As for complexity, this
> actually reduces complexity needed for XDP in the drivers which is a
> good thing because that's where most of the support and development
> pain will be.

I'm not objecting to anything that simplifies the process of adding
XDP capability to drivers. You have my full support here.

> I am looking at using this for ILA router. The problem I am hitting is
> that not all packets that we need to translate go through the XDP
> path. Some would go through the kernel path, some through XDP path but

When you say kernel path, what do you mean specifically? One aspect of
XDP I love is that XDP can act as an acceleration option for existing
BPF programs attached to cls_bpf. Support for direct packet read and
write at clsact level have made it straight forward to write programs
which are compatible or at minimum share a lot of common code. They
can share data structures, lookup functionality, etc.

> We can optimize for allowing only one hook, or maybe limit to only
> allowing one hook to be set. In any case this obviously requires a lot
> of performance evaluation, I am hoping to feedback on the design
> first. My question about using a linear list for this was real, do you
> know a better method off hand to implement a call list?

My main concern is that we overload the XDP hook. Instead of making use
of the programmable glue, we put a linked list in front where everybody
can attach a program to.

A possible alternative:
 1. The XDP hook always has single consumer controlled by the user
through Netlink, BPF is one of them. If a user wants to hardcode
the ILA router to that hook, he can do that.

 2. BPF for XDP is extended to allow returning a verdict which results
in something else to be invoked. If user wants to invoke the ILA
router for just some packets, he can do that.

That said, I see so much value in a BPF implementation of ILA at XDP
level with all of the parsing logic and exact semantics remain
flexible without the cost of translating some configuration to a set
of actions.


Re: [PATCH RFC 1/3] xdp: Infrastructure to generalize XDP

2016-09-20 Thread Thomas Graf
On 09/20/16 at 04:18pm, Tom Herbert wrote:
> This allows other use cases than BPF inserting code into the data
> path. This gives XDP potential more utility and more users so that we
> can motivate more driver implementations. For instance, I thinks it's
> totally reasonable if the nftables guys want to insert some of their
> rules to perform early DDOS drop to get the same performance that we
> see in XDP.

Reasonable point with nftables but are any of these users on the table
already and ready to consume non-skbs? It would be a pity to add this
complexity and cost if it is never used.

I don't see how we can ensure performance if we have multiple
subsystems register for the hook each adding their own parsers which
need to be passed through sequentially. Maybe I'm missing something.


Re: [PATCH RFC 1/3] xdp: Infrastructure to generalize XDP

2016-09-20 Thread Thomas Graf
On 09/20/16 at 03:49pm, Tom Herbert wrote:
> On Tue, Sep 20, 2016 at 3:44 PM, Thomas Graf <tg...@suug.ch> wrote:
> > On 09/20/16 at 03:00pm, Tom Herbert wrote:
> >> +static inline int __xdp_hook_run(struct list_head *list_head,
> >> +  struct xdp_buff *xdp)
> >> +{
> >> + struct xdp_hook_ops *elem;
> >> + int ret = XDP_PASS;
> >> +
> >> + list_for_each_entry(elem, list_head, list) {
> >> + ret = elem->hook(elem->priv, xdp);
> >> + if (ret != XDP_PASS)
> >> + break;
> >> + }
> >
> > Walking over a linear list? Really? :-) I thought this was supposed
> > to be fast, no compromises made.
> 
> Can you suggest an alternative?

Single BPF program that encodes whatever logic is required. This is
what BPF is for. If it absolutely has to run two programs in sequence
then it can still do that even though I really don't see much of a
point of doing that in a high performance environment.

I'm not even sure yet I understand full purpose of this yet ;-)


Re: [PATCH RFC 1/3] xdp: Infrastructure to generalize XDP

2016-09-20 Thread Thomas Graf
On 09/20/16 at 03:00pm, Tom Herbert wrote:
> +static inline int __xdp_hook_run(struct list_head *list_head,
> +  struct xdp_buff *xdp)
> +{
> + struct xdp_hook_ops *elem;
> + int ret = XDP_PASS;
> +
> + list_for_each_entry(elem, list_head, list) {
> + ret = elem->hook(elem->priv, xdp);
> + if (ret != XDP_PASS)
> + break;
> + }

Walking over a linear list? Really? :-) I thought this was supposed
to be fast, no compromises made.


Re: [PATCH net-next 5/7] rhashtable: abstract out function to get hash

2016-09-20 Thread Thomas Graf
On 09/20/16 at 05:36pm, Herbert Xu wrote:
> Tom Herbert <t...@herbertland.com> wrote:
> > Split out most of rht_key_hashfn which is calculating the hash into
> > its own function. This way the hash function can be called separately to
> > get the hash value.
> > 
> > Acked-by: Thomas Graf <tg...@suug.ch>
> > Signed-off-by: Tom Herbert <t...@herbertland.com>
> 
> I don't get this one.  You're just using jhash, right? Why not
> call jhash directly instead of rht_get_key_hashfn?

FYI, there is a v2 of this series, just so you don't have to do the
work twice.

I understand this particular patch as an effort not to duplicate
hash function selection such as jhash vs jhash2 based on key_len.


Re: [PATCH net-next 7/8] net/mlx5e: XDP TX forwarding support

2016-09-20 Thread Thomas Graf
On 09/20/16 at 09:45am, Alexei Starovoitov wrote:
> that's a great idea. Instead of adding aborted, ring_full, blahblah counters
> everywhere that cost performance, let's add tracepoints that are nops
> when not in use.
> And if all drivers call the same tracepoints it will be even better.
> Monitoring trace_xdp_aborted tracepoint would be generic way to debug
> 'divide by zero'.

Not opposed but I still need an indication to start tracing ;-) A
single counter would be enough though.


Re: [PATCH v6 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs

2016-09-20 Thread Thomas Graf
On 09/20/16 at 04:29pm, Pablo Neira Ayuso wrote:
> On Mon, Sep 19, 2016 at 10:56:14PM +0200, Daniel Mack wrote:
> [...]
> > Why would we artificially limit the use-cases of this implementation if
> > the way it stands, both filtering and introspection are possible?
> 
> Why should we place infrastructure in the kernel to filter packets so
> late, and why at postrouting btw, when we can do this way earlier
> before any packet is actually sent? No performance impact, no need for
> skbuff allocation and *no cycles wasted to evaluate if every packet is
> wanted or not*.

I won't argue against filtering at socket level, it is very valuable.
A difference is transparency of enforcement. Dropping at networking
level is accounted for in the usual counters and well understood by
admins. "Dropping" at bind socket level would either involve returning
an error to the application (which may change behaviour of the application)
or require a new form of accounting.

I also want to see packet size, selected source address, outgoing device,
and attach tunnel key metadata to the skb based on the cgroup. All of
which are not available at socket level.


Re: [PATCH v6 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs

2016-09-19 Thread Thomas Graf
On 09/19/16 at 01:13pm, Alexei Starovoitov wrote:
> as far as bpf debuggability/visibility there are various efforts on the way:
> for kernel side:
> - ksym for jit-ed programs
> - hash sum for prog code
> - compact type information for maps and various pretty printers
> - data flow analysis of the programs

We made a generic map tool available here as well:
https://github.com/cilium/cilium/blob/master/bpf/go/map_ctrl.go


Re: [v3 PATCH 1/2] rhashtable: Add rhlist interface

2016-09-19 Thread Thomas Graf
On 09/19/16 at 07:00pm, Herbert Xu wrote:
> The insecure_elasticity setting is an ugly wart brought out by
> users who need to insert duplicate objects (that is, distinct
> objects with identical keys) into the same table.
> 
> In fact, those users have a much bigger problem.  Once those
> duplicate objects are inserted, they don't have an interface to
> find them (unless you count the walker interface which walks
> over the entire table).
> 
> Some users have resorted to doing a manual walk over the hash
> table which is of course broken because they don't handle the
> potential existence of multiple hash tables.  The result is that
> they will break sporadically when they encounter a hash table
> resize/rehash.
> 
> This patch provides a way out for those users, at the expense
> of an extra pointer per object.  Essentially each object is now
> a list of objects carrying the same key.  The hash table will
> only see the lists so nothing changes as far as rhashtable is
> concerned.
> 
> To use this new interface, you need to insert a struct rhlist_head
> into your objects instead of struct rhash_head.  While the hash
> table is unchanged, for type-safety you'll need to use struct
> rhltable instead of struct rhashtable.  All the existing interfaces
> have been duplicated for rhlist, including the hash table walker.
> 
> One missing feature is nulls marking because AFAIK the only potential
> user of it does not need duplicate objects.  Should anyone need
> this it shouldn't be too hard to add.
> 
> Signed-off-by: Herbert Xu <herb...@gondor.apana.org.au>

Nice, I like how this simplifies users! Is this suitable for
ILA as well?

Acked-by: Thomas Graf <tg...@suug.ch>



Re: [PATCH v5 0/6] Add eBPF hooks for cgroups

2016-09-14 Thread Thomas Graf
On 09/14/16 at 12:30pm, Pablo Neira Ayuso wrote:
> On Tue, Sep 13, 2016 at 09:42:19PM -0700, Alexei Starovoitov wrote:
> [...]
> > For us this cgroup+bpf is _not_ for filterting and _not_ for security.
> 
> If your goal is monitoring, then convert these hooks not to allow to
> issue a verdict on the packet, so this becomes inoquous in the same
> fashion as the tracing infrastructure.

Why? How is this at all offensive? We have three parties voicing
interest in this work for both monitoring and security. At least
two specific use cases have been described.  It builds on top of
existing infrastructure and nicely complements other ongoing work.
Why not both?


Re: [PATCH RFC 5/6] net: Generic resolver backend

2016-09-14 Thread Thomas Graf
On 09/09/16 at 04:19pm, Tom Herbert wrote:
> diff --git a/net/core/resolver.c b/net/core/resolver.c
> new file mode 100644
> index 000..61b36c5
> --- /dev/null
> +++ b/net/core/resolver.c
> @@ -0,0 +1,267 @@
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 

This include list could be stripped down a bit. ila, lwt, fib, ...

> +
> +static struct net_rslv_ent *net_rslv_new_ent(struct net_rslv *nrslv,
> +  void *key)

Comment above that net_rslv_get_lock() must be held?

> +{
> + struct net_rslv_ent *nrent;
> + int err;
> +
> + nrent = kzalloc(sizeof(*nrent) + nrslv->obj_size, GFP_KERNEL);

GFP_ATOMIC since you typically hold net_rslv_get_lock() spinlock?

> + if (!nrent)
> + return ERR_PTR(-EAGAIN);
> +
> + /* Key is always at beginning of object data */
> + memcpy(nrent->object, key, nrslv->params.key_len);
> +
> + /* Initialize user data */
> + if (nrslv->rslv_init)
> + nrslv->rslv_init(nrslv, nrent);
> +
> + /* Put in hash table */
> + err = rhashtable_lookup_insert_fast(>rhash_table,
> + >node, nrslv->params);
> + if (err)
> + return ERR_PTR(err);
> +
> + if (nrslv->timeout) {
> + /* Schedule timeout for resolver */
> + INIT_DELAYED_WORK(>timeout_work, net_rslv_delayed_work);

Should this be done before inserting into rhashtable?

> + schedule_delayed_work(>timeout_work, nrslv->timeout);
> + }
> +
> + nrent->nrslv = nrslv;

Same here.  net_rslv_cancel_all_delayed_work() walking the rhashtable could
see ->nrslv as NULL.



Re: [PATCH RFC 1/6] spinlock: Add library function to allocate spinlock buckets array

2016-09-14 Thread Thomas Graf
On 09/09/16 at 04:19pm, Tom Herbert wrote:
> Add two new library functions alloc_bucket_spinlocks and
> free_bucket_spinlocks. These are use to allocate and free an array
> of spinlocks that are useful as locks for hash buckets. The interface
> specifies the maximum number of spinlocks in the array as well
> as a CPU multiplier to derive the number of spinlocks to allocate.
> The number to allocated is rounded up to a power of two to make
> the array amenable to hash lookup.
> 
> Signed-off-by: Tom Herbert <t...@herbertland.com>

Acked-by: Thomas Graf <tg...@suug.ch>


Re: [PATCH RFC 4/6] rhashtable: abstract out function to get hash

2016-09-14 Thread Thomas Graf
On 09/09/16 at 04:19pm, Tom Herbert wrote:
> Split out most of rht_key_hashfn which is calculating the hash into
> its own function. This way the hash function can be called separately to
> get the hash value.
> 
> Signed-off-by: Tom Herbert <t...@herbertland.com>

Acked-by: Thomas Graf <tg...@suug.ch>


Re: [PATCH RFC 2/6] rhashtable: Call library function alloc_bucket_locks

2016-09-14 Thread Thomas Graf
On 09/09/16 at 04:19pm, Tom Herbert wrote:
> To allocate the array of bucket locks for the hash table we now
> call library function alloc_bucket_spinlocks. This function is
> based on the old alloc_bucket_locks in rhashtable and should
> produce the same effect.
> 
> Signed-off-by: Tom Herbert <t...@herbertland.com>

Acked-by: Thomas Graf <tg...@suug.ch>


Re: [PATCH v5 0/6] Add eBPF hooks for cgroups

2016-09-14 Thread Thomas Graf
[Sorry for the repost, gmail decided to start sending HTML crap along
 overnight for some reason]

On 09/13/16 at 09:42pm, Alexei Starovoitov wrote:
> On Tue, Sep 13, 2016 at 07:24:08PM +0200, Pablo Neira Ayuso wrote:
> > Then you have to explain me how can anyone else than systemd use this
> > infrastructure?
> 
> Jokes aside. I'm puzzled why systemd is even being mentioned here.
> Here we use tupperware (our internal container management system) that
> is heavily using cgroups and has nothing to do with systemd.

Just confirming that we are planning to use this decoupled from
systemd as well.  I fail to see how this is at all systemd specific.

> For us this cgroup+bpf is _not_ for filterting and _not_ for security.
> We run a ton of tasks in cgroups that launch all sorts of
> things on their own. We need to monitor what they do from networking
> point of view. Therefore bpf programs need to monitor the traffic in
> particular part of cgroup hierarchy. Not globally and no pass/drop decisions.

+10, although filtering/drop is a valid use case, the really strong
use case is definitely introspection at networking level. Statistics,
monitoring, verification of application correctness, etc. 

I don't see why this is at all an either or discussion. If nft wants
cgroups integration similar to this effort, I see no reason why that
should stop this effort.


[PATCH iproute2] tuntap: Add name attribute to usage text

2016-09-08 Thread Thomas Graf
Signed-off-by: Thomas Graf <tg...@suug.ch>
---
 ip/iptuntap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ip/iptuntap.c b/ip/iptuntap.c
index 34fb0cf..451f7f0 100644
--- a/ip/iptuntap.c
+++ b/ip/iptuntap.c
@@ -39,7 +39,7 @@ static void usage(void)
 {
fprintf(stderr, "Usage: ip tuntap { add | del | show | list | lst | 
help } [ dev PHYS_DEV ]\n");
fprintf(stderr, "  [ mode { tun | tap } ] [ user USER ] [ group 
GROUP ]\n");
-   fprintf(stderr, "  [ one_queue ] [ pi ] [ vnet_hdr ] [ 
multi_queue ]\n");
+   fprintf(stderr, "  [ one_queue ] [ pi ] [ vnet_hdr ] [ 
multi_queue ] [ name NAME ]\n");
fprintf(stderr, "\n");
fprintf(stderr, "Where: USER  := { STRING | NUMBER }\n");
fprintf(stderr, "   GROUP := { STRING | NUMBER }\n");
-- 
2.5.5



Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups

2016-08-19 Thread Thomas Graf
On 08/19/16 at 06:21pm, Pablo Neira Ayuso wrote:
> On Fri, Aug 19, 2016 at 12:35:14PM +0200, Daniel Mack wrote:
> > Also true. A cgroup can currently only hold one bpf program for each
> > direction, and they are supposed to be set from one controlling instance
> > in the system. However, it is possible to create subcgroups, and install
> > own programs in them, which will then be effective instead of the one in
> > the parent. They will, however, replace each other in runtime behavior,
> > and not be stacked. This is a fundamentally different approach than how
> > nf_tables works of course.
> 
> I see, this looks problematic indeed, thanks for confirming this.

What exactly is problematic? I think the proposed mechanism is very
clean in allowing sub groups to provide the entire program. This
allows for delegation. Different orchestrators can manage different
cgroups. It's different as Daniel stated. I don't see how this is
problematic though.

You brought up multiple tables which reflect the cumulative approach.
This sometimes works but has its issues as well. Users must be aware
of each other and anticipate what rules other users might inject
before or after their own tables. The very existence of firewalld which
aims at democratizing this collaboration proves this point.

So in that sense I would very much like for both models to be made
available to users. nftables+cgroups for a cumulative approach as
well as BPF+cgroups for the delegation approach.  I don't see why the
cgroups based filtering capability should not be made available to both.



Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups

2016-08-19 Thread Thomas Graf
On 08/19/16 at 06:31pm, Pablo Neira Ayuso wrote:
> Why do you need global seccomp policies? The process knows better what
> he needs to place in his sandbox, so attaching this from the process
> itself makes more sense to me... Anyway, this reminds me to selinux.

Two different objectives. The point is to sandbox applications and
restrict their capabilities. It's not the application itself but the task
orchestration system around it that manages the policies.


Re: [PATCH 0/3] rhashtable: Get rid of raw table walkers part 1

2016-08-19 Thread Thomas Graf
On 08/18/16 at 04:48pm, Herbert Xu wrote:
> Dave/Tomas, please keep an eye out for any new patches that try
> to introduce raw table walkers and nack them.

Very nice work Herbert. I think we should move the rht_for_each_* macros
to a private header or lib/rhashtable.c to discourage usage altogether.


Re: [v2 PATCH 3/3] netlink: Use rhashtable walk interface in diag dump

2016-08-19 Thread Thomas Graf
On 08/19/16 at 04:21pm, Herbert Xu wrote:
> This patch converts the diag dumping code to use the rhashtable
> walk code instead of going through rhashtable by hand.  The lock
> nl_table_lock is now only taken while we process the multicast
> list as it's not needed for the rhashtable walk.
> 
> Signed-off-by: Herbert Xu 

LGTM. Curious, what did you use to test this? I've been struggling
to find a good test case.


Re: [PATCH 2/3] MAINTAINERS: Add extra rhashtable maintainer

2016-08-19 Thread Thomas Graf
On 08/18/16 at 04:50pm, Herbert Xu wrote:
> As I'm working actively on rhashtable it helps if people CCed me
> when they work on in.
> 
> Signed-off-by: Herbert Xu <herb...@gondor.apana.org.au>

Acked-by: Thomas Graf <tg...@suug.ch>


  1   2   3   4   5   6   7   8   9   10   >