Re: [PATCH RFC PoC 0/3] nftables meets bpf

2018-02-20 Thread Daniel Borkmann
Hi Pablo,

On 02/20/2018 11:58 AM, Pablo Neira Ayuso wrote:
> On Mon, Feb 19, 2018 at 08:57:39PM +0100, Daniel Borkmann wrote:
>> On 02/19/2018 05:37 PM, Pablo Neira Ayuso wrote:
>> [...]
>>> * Simplified infrastructure: We don't need the ebpf verifier complexity
>>>   either given we trust the code we generate from the kernel. We don't
>>>   need any complex userspace tooling either, just libnftnl and nft
>>>   userspace binaries.
>>>
>>> * Hardware offload: We can use this to offload rulesets to the only
>>>   smartnic driver that we have in the tree that already implements bpf
>>>   offload, hence, we can reuse this work already in place.
>>
>> In addition Dave's points, regarding the above two, this will also only
>> work behind the verifier since NIC offloading piggy-backs on the verifier's
>> program analysis to prepare and generate a dev specific JITed BPF
>> prog, so it's not the same as normal host JITs (and there, the cBPF ->
>> eBPF in kernel migration adds a lot of headaches already due to
>> different underlying assumptions coming from the two flavors, even
>> if both are eBPF insns in the end), and given this, offloading will
>> also only work for eBPF and not cBPF.
> 
> We also have a large range of TCAM based hardware offload outthere
> that will _not_ work with your BPF HW offload infrastructure. What
> this bpf infrastructure pushes into the kernel is just a blob
> expressing things in a very low-level instruction-set: trying to find
> a mapping of that to typical HW intermediate representations in the
> TCAM based HW offload world will be simply crazy.

Sure, and I think that's fine; there have been possible ways proposed
in last netdev conference how this can be addressed by adding hints [0]
in a programmable way as meta data in front of the packet as one option
to accelerate. Other than that for fully pushing into hardware people will
get a SmartNIC and there are multiple big vendors in that area working
on them. Potentially in few years from now they're more and more becoming
a commodity in DCs, lets see. Maybe we'll be programming them similarly
as the case with graphics cards today. :-)

  [0] 
https://www.netdevconf.org/2.2/session.html?waskiewicz-xdpacceleration-talk

>> There's a lot more the verifier is doing internally, like performing
>> various different program rewrites from the context, for helpers
>> (e.g. inlining), and for internal insn mappings that are not exposed
>> (e.g. in calls), so we definitely need to go through it.
> 
> If we need to call the verifier from the kernel for the code that we
> generate there for this initial stage, that should be not an issue.
> 
> The BPF interface is lacking many of the features and flexibility we
> have in netlink these days, and it is only allowing for monolitic
> ruleset replacement. This approach also loses internal rule stateful

That only depends how you partition your program, a partial reconfiguration
is definitely possible and done so today, for example as talked about in
LB use case where the packet processing is staged e.g. into sampling,
DDoS mitigation, and encap + redirect phase, where each of the components
can be replaced atomically during runtime. So there is definitely flexibility
available.

Thanks,
Daniel

> information that we're doing in the packet path when updating the
> ruleset. So it's taking us back to exactly the same mistakes we made
> in iptables back in the 90s as it's been mentioned already.
> 
> So I just wish I can count with your help in this process, we can get
> the best of the two worlds by providing a subsystem that allows users
> to configure packet classification through one single interface, no
> matter if the policy representation ends up being in software or HW
> offloads, either TCAM or smartnic.
> 
> Thanks.
> 

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-20 Thread Daniel Borkmann
On 02/20/2018 11:44 AM, Pablo Neira Ayuso wrote:
> Hi David!
> 
> On Mon, Feb 19, 2018 at 12:22:26PM -0500, David Miller wrote:
> [...]
>> Netfilter's chronic performance differential is why a lot of mindshare
>> was lost to userspace networking technologies.
> 
> Claiming that Netfilter is the reason for the massive adoption of
> userspace networking isn't a fair statement at all.
> 
> Let's talk about performance if this is what you want:
> 
> * Our benchmarks here are delivering ~x9.5 performance boost for IPv4
>   load balancing from netfilter ingress.
> 
> * ~x2 faster than iptables prerouting when dropping packets at very
>   early stage in the network datapath - dos attack scenario - again from
>   the ingress hook.
> 
> * The new flowtable infrastructure that will show up in 4.16 provides
>   a faster forwarding path, measuring ~x2 faster forwarding here, _by
>   simply adding one single rule to your FORWARD chain_. And that's
>   just the initial implementation that got merged upstream, we have
>   room to fly even faster.
> 
> And that's just the beginning, we have more ongoing work, incrementally
> based on top of what we have, to provide even faster datapath paths with
> very simple configurations.
> 
> Note that those numbers above are very similar numbers to what we have
> seen in bpf.  Well, to be honest, we're just slightly behind bpf, since
> benchmarks I have seen on loading balancing IPv4 is x10 from XDP,
> dropping packets also slightly more than x2, which is actually happening
> way earlier than ingress, naturally dropping earlier gives us better
> numbers.
> 
> But it's not all about performance... let's have a look at the "iron
> triangle"...
> 
> We keep usability in our radar, that's paramount for us. Netfilter is
> probably so much widely more adopted than tc because of this. The kind

Right, in terms of performance the above is what tc ingress used to do
already long ago after spinlock removal could be lifted, which was an
important step on that direction. In terms of usability, sure, it's always
a 'fun' topic on that matter for a number of classifier / actions mostly
from the older days. I think there it has improved a bit over time,
but at least speaking of things like cls_bpf, it's trivial to attach an
object somewhere via tc cmdline.

> of problems that big Silicon datacenters have to face are simply
> different to the millions of devices running Linux outthere, there are
> plenty of smart devops outthere that sacrify the little performance loss
> at the cost of keeping it easy to configure and maintain things.
> 
> If we want to talk about problems...
> 
> Every project has its own subset of problems. In that sense, anyone that
> has spent time playing with the bpf infrastructure is very much aware of
> all of its usability problems:
> 
> * You have to disable optimizations in llvm, otherwise the verifier
>   gets confused too smart compiler optimizations and rejects the code.

That is actually a false claim, which makes me think that you didn't even
give this a try at all before stating the above. Funny enough, for a very
long period of time in LLVM's BPF back end when you used other optimization
levels than the -O2, clang would bark with an internal error, for example:

  $ clang-3.9 -target bpf -O0 -c foo.c -o /tmp/foo.o
  fatal error: error in backend: Cannot select: 0x5633ae698280: ch,glue = 
BPFISD::CALL 0x5633ae698210, 0x5633ae697e90, Register:i64 %R1, Register:i64 
%R2, Register:i64 %R3,
  0x5633ae698210:1
 0x5633ae697e90: i64,ch = load 0x5633ae6955e0, 
0x5633ae694fc0, undef:i64
  0x5633ae694fc0: i64 = BPFISD::Wrapper TargetGlobalAddress:i64 0
  [...]

Whereas -O2 *is* the general recommendation for everyone to use:

  $ clang-3.9 -target bpf -O2 -c foo.c -o /tmp/foo.o
  $

This is fixed in later versions, e.g. in clang-7.0 such back end error is
gone anyway fwiw. But in any case, we're running complex programs with -O2
optimization levels for several years now just fine. Yes, given we do push
BPF to the limits we had some corner cases where the verifier had to be
adjusted, but overall the number of cases reduced over time, which is also
a natural progression when people use it in various advanced ways. In fact,
it's a much better choice to use clang with -O2 here since simply the majority
of people use it that way. And if you consume it via higher level front ends
e.g. bcc, ply, bpftrace to name a few from tracing side, then you don't need
to care at all about this. (But in addition to that, there's also continuous
effort on LLVM side to optimize BPF code generation in various ways.)

> * Very hard to debug the reason why the verifier is rejecting apparently
>   valid code. That results in people playing strange "moving code around
>   up and down".

Please show me your programs and I'm happy to help you out. :-) Yes, in the
earlier days, I would consider it might have been hard; during the course
of the last few years, the verifier and 

Re: [PATCH RFC PoC 0/3] nftables meets bpf

2018-02-19 Thread Daniel Borkmann
On 02/19/2018 05:37 PM, Pablo Neira Ayuso wrote:
[...]
> * Simplified infrastructure: We don't need the ebpf verifier complexity
>   either given we trust the code we generate from the kernel. We don't
>   need any complex userspace tooling either, just libnftnl and nft
>   userspace binaries.
> 
> * Hardware offload: We can use this to offload rulesets to the only
>   smartnic driver that we have in the tree that already implements bpf
>   offload, hence, we can reuse this work already in place.

In addition Dave's points, regarding the above two, this will also only
work behind the verifier since NIC offloading piggy-backs on the verifier's
program analysis to prepare and generate a dev specific JITed BPF prog, so
it's not the same as normal host JITs (and there, the cBPF -> eBPF in kernel
migration adds a lot of headaches already due to different underlying
assumptions coming from the two flavors, even if both are eBPF insns in the
end), and given this, offloading will also only work for eBPF and not cBPF.
There's a lot more the verifier is doing internally, like performing various
different program rewrites from the context, for helpers (e.g. inlining),
and for internal insn mappings that are not exposed (e.g. in calls), so we
definitely need to go through it.

Thanks,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-16 Thread Daniel Borkmann
Hi Florian,

On 02/16/2018 05:14 PM, Florian Westphal wrote:
> Florian Westphal <f...@strlen.de> wrote:
>> Daniel Borkmann <dan...@iogearbox.net> wrote:
>> Several questions spinning at the moment, I will probably come up with
>> more:
> 
> ... and here there are some more ...
> 
> One of the many pain points of xtables design is the assumption of 'used
> only by sysadmin'.
> 
> This has not been true for a very long time, so by now iptables has
> this userspace lock (yes, its fugly workaround) to serialize concurrent
> iptables invocations in userspace.
> 
> AFAIU the translate-in-userspace design now brings back the old problem
> of different tools overwriting each others iptables rules.

Right, so the behavior would need to be adapted to be exactly the same,
given all the requests go into kernel space first via the usual uapis,
I don't think there would be anything in the way of keeping that as is.

> Another question -- am i correct in that each rule manipulation would
> incur a 'recompilation'?  Or are there different mini programs chained
> together?

Right now in the PoC yes, basically it regenerates the program on the fly
in gen.c when walking the struct bpfilter_ipt_ip's and appends the entries
to the program, but it doesn't have to be that way. There are multiple
options to allow for a partial code generation, e.g. via chaining tail
call arrays or directly via BPF to BPF calls eventually, there would be
few changes on BPF side needed, but it can be done; there could additionally
be various optimizations passes during code generation phase performed
while keeping given constraints in order to speed up getting to a verdict.

> One of the nftables advantages is that (since rule representation in
> kernel is black-box from userspace point of view) is that the kernel
> can announce add/delete of rules or elements from nftables sets.
> 
> Any particular reason why translating iptables rather than nftables
> (it should be possible to monitor the nftables changes that are
>  announced by kernel and act on those)?

Yeah, correct, this should be possible as well. We started out with the
iptables part in the demo as the majority of bigger infrastructure projects
all still rely heavily on it (e.g. docker, k8s to just name two big ones).
Usually they have their requests to iptables baked into their code directly
which probably won't change any time soon, so thought was that they could
benefit initially from it once there would be sufficient coverage.

Thanks,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-16 Thread Daniel Borkmann
Hi Florian,

thanks for your feedback! More inline:

On 02/16/2018 03:57 PM, Florian Westphal wrote:
> Daniel Borkmann <dan...@iogearbox.net> wrote:
>> This is a very rough and early proof of concept that implements bpfilter.
> 
> [..]
> 
>> Also, as a benefit from such design, we get BPF JIT compilation on x86_64,
>> arm64, ppc64, sparc64, mips64, s390x and arm32, but also rule offloading
>> into HW for free for Netronome NFP SmartNICs that are already capable of
>> offloading BPF since we can reuse all existing BPF infrastructure as the
>> back end. The user space iptables binary issuing rule addition or dumps was
>> left as-is, thus at some point any binaries against iptables uapi kernel
>> interface could transparently be supported in such manner in long term.
>>
>> As rule translation can potentially become very complex, this is performed
>> entirely in user space. In order to ease deployment, request_module() code
>> is extended to allow user mode helpers to be invoked. Idea is that user mode
>> helpers are built as part of the kernel build and installed as traditional
>> kernel modules with .ko file extension into distro specified location,
>> such that from a distribution point of view, they are no different than
>> regular kernel modules. Thus, allow request_module() logic to load such
>> user mode helper (umh) binaries via:
>>
>>   request_module("foo") ->
>> call_umh("modprobe foo") ->
>>   sys_finit_module(FD of /lib/modules/.../foo.ko) ->
>> call_umh(struct file)
>>
>> Such approach enables kernel to delegate functionality traditionally done
>> by kernel modules into user space processes (either root or !root) and
>> reduces security attack surface of such new code, meaning in case of
>> potential bugs only the umh would crash but not the kernel. Another
>> advantage coming with that would be that bpfilter.ko can be debugged and
>> tested out of user space as well (e.g. opening the possibility to run
>> all clang sanitizers, fuzzers or test suites for checking translation).
> 
> Several questions spinning at the moment, I will probably come up with
> more:

Sure, no problem at all. It's an early RFC, so purpose is to get a
discussion going on such potential approach.

> 1. Does this still attach the binary blob to the 'normal' iptables
>hooks?

Yeah, so thought would be to keep the user land tooling functional as
is w/o having to recompile binaries, thus this would also need to attach
for the existing hooks in order to keep semantics working. As a benefit
in addition we can also reuse all the rest of the infrastructure to utilize
things like XDP for iptables in the background, there is definitely
flexibility on this side thus users could eventually benefit from this
transparently and don't need to know that 'bpfilter' exists and is
translating in the background. I realize taking this path is a long term
undertake that we would need to tackle as a community, not just one or
two individuals when we decide to go for this direction.

> 2. If yes, do you see issues wrt. 'iptables' and 'bpfilter' attached
> programs being different in nature (e.g. changed by different entities)?

There could certainly be multiple options, e.g. a fall-through with state
transfer once a request cannot be handled yet or a sysctl with iptables
being the default handler and an option to switch to bpfilter for letting
it handle requests for that time being.

> 3. What happens if the rule can't be translated (yet?)

(See above.)

> 4. Do you plan to reimplement connection tracking in userspace?

One option could be to have a generic, skb-less connection tracker in kernel
that can be reused from the various hooks it would need to handle, potentially
that would also be able to get offloaded into HW as another benefit coming
out from that.

> If no, how will the bpf program interact with it?
> [ same question applies to ipv6 exthdr traversal, ip defragmentation
> and the like ].

The v6 exthdr traversal could be realized natively via BPF which should
make the parsing more robust at the same time than having it somewhere
inside a helper in kernel directly; bounded loops in BPF would help as
well on that front, similarly for defrag this could be handled by the prog
although here we would need additional infra to queue the packets and then
recirculate.

> I will probably have a quadrillion of followup questions, sorry :-/

Definitely, please do!

Thanks,
Daniel

>> Also, such architecture makes the kernel/user boundary very precise,
>> meaning requests can be handled and BPF translated in control plane part
>> in user space with its own user memory etc, while minimal data plane
>> bits are in kernel. It would also allow to

[PATCH RFC 3/4] net: initial bpfilter skeleton

2018-02-16 Thread Daniel Borkmann
From: "David S. Miller" 

Signed-off-by: David S. Miller 
Signed-off-by: Alexei Starovoitov 
---
 include/linux/bpfilter.h  |  13 +++
 include/uapi/linux/bpfilter.h | 200 ++
 net/Kconfig   |   2 +
 net/Makefile  |   1 +
 net/bpfilter/Kconfig  |   7 ++
 net/bpfilter/Makefile |   9 ++
 net/bpfilter/bpfilter.c   |  89 +++
 net/bpfilter/bpfilter_mod.h   |  96 
 net/bpfilter/ctor.c   |  80 +
 net/bpfilter/init.c   |  33 +++
 net/bpfilter/sockopt.c| 153 
 net/bpfilter/tables.c |  70 +++
 net/bpfilter/targets.c|  51 +++
 net/bpfilter/tgts.c   |  25 ++
 net/ipv4/Makefile |   2 +
 net/ipv4/bpfilter/Makefile|   2 +
 net/ipv4/bpfilter/sockopt.c   |  49 +++
 net/ipv4/ip_sockglue.c|  17 
 18 files changed, 899 insertions(+)
 create mode 100644 include/linux/bpfilter.h
 create mode 100644 include/uapi/linux/bpfilter.h
 create mode 100644 net/bpfilter/Kconfig
 create mode 100644 net/bpfilter/Makefile
 create mode 100644 net/bpfilter/bpfilter.c
 create mode 100644 net/bpfilter/bpfilter_mod.h
 create mode 100644 net/bpfilter/ctor.c
 create mode 100644 net/bpfilter/init.c
 create mode 100644 net/bpfilter/sockopt.c
 create mode 100644 net/bpfilter/tables.c
 create mode 100644 net/bpfilter/targets.c
 create mode 100644 net/bpfilter/tgts.c
 create mode 100644 net/ipv4/bpfilter/Makefile
 create mode 100644 net/ipv4/bpfilter/sockopt.c

diff --git a/include/linux/bpfilter.h b/include/linux/bpfilter.h
new file mode 100644
index 000..26adad1
--- /dev/null
+++ b/include/linux/bpfilter.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_BPFILTER_H
+#define _LINUX_BPFILTER_H
+
+#include 
+
+struct sock;
+int bpfilter_ip_set_sockopt(struct sock *sk, int optname, char *optval,
+   unsigned int optlen);
+int bpfilter_ip_get_sockopt(struct sock *sk, int optname, char *optval,
+   int *optlen);
+#endif
+
diff --git a/include/uapi/linux/bpfilter.h b/include/uapi/linux/bpfilter.h
new file mode 100644
index 000..38d54e9
--- /dev/null
+++ b/include/uapi/linux/bpfilter.h
@@ -0,0 +1,200 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _UAPI_LINUX_BPFILTER_H
+#define _UAPI_LINUX_BPFILTER_H
+
+#include 
+
+enum {
+   BPFILTER_IPT_SO_SET_REPLACE = 64,
+   BPFILTER_IPT_SO_SET_ADD_COUNTERS = 65,
+   BPFILTER_IPT_SET_MAX,
+};
+
+enum {
+   BPFILTER_IPT_SO_GET_INFO = 64,
+   BPFILTER_IPT_SO_GET_ENTRIES = 65,
+   BPFILTER_IPT_SO_GET_REVISION_MATCH = 66,
+   BPFILTER_IPT_SO_GET_REVISION_TARGET = 67,
+   BPFILTER_IPT_GET_MAX,
+};
+
+enum {
+   BPFILTER_XT_TABLE_MAXNAMELEN = 32,
+};
+
+enum {
+   BPFILTER_NF_DROP = 0,
+   BPFILTER_NF_ACCEPT = 1,
+   BPFILTER_NF_STOLEN = 2,
+   BPFILTER_NF_QUEUE = 3,
+   BPFILTER_NF_REPEAT = 4,
+   BPFILTER_NF_STOP = 5,
+   BPFILTER_NF_MAX_VERDICT = BPFILTER_NF_STOP,
+};
+
+enum {
+   BPFILTER_INET_HOOK_PRE_ROUTING  = 0,
+   BPFILTER_INET_HOOK_LOCAL_IN = 1,
+   BPFILTER_INET_HOOK_FORWARD  = 2,
+   BPFILTER_INET_HOOK_LOCAL_OUT= 3,
+   BPFILTER_INET_HOOK_POST_ROUTING = 4,
+   BPFILTER_INET_HOOK_MAX,
+};
+
+enum {
+   BPFILTER_PROTO_UNSPEC   = 0,
+   BPFILTER_PROTO_INET = 1,
+   BPFILTER_PROTO_IPV4 = 2,
+   BPFILTER_PROTO_ARP  = 3,
+   BPFILTER_PROTO_NETDEV   = 5,
+   BPFILTER_PROTO_BRIDGE   = 7,
+   BPFILTER_PROTO_IPV6 = 10,
+   BPFILTER_PROTO_DECNET   = 12,
+   BPFILTER_PROTO_NUMPROTO,
+};
+
+#ifndef INT_MAX
+#define INT_MAX((int)(~0U>>1))
+#endif
+#ifndef INT_MIN
+#define INT_MIN (-INT_MAX - 1)
+#endif
+
+enum {
+   BPFILTER_IP_PRI_FIRST   = INT_MIN,
+   BPFILTER_IP_PRI_CONNTRACK_DEFRAG= -400,
+   BPFILTER_IP_PRI_RAW = -300,
+   BPFILTER_IP_PRI_SELINUX_FIRST   = -225,
+   BPFILTER_IP_PRI_CONNTRACK   = -200,
+   BPFILTER_IP_PRI_MANGLE  = -150,
+   BPFILTER_IP_PRI_NAT_DST = -100,
+   BPFILTER_IP_PRI_FILTER  = 0,
+   BPFILTER_IP_PRI_SECURITY= 50,
+   BPFILTER_IP_PRI_NAT_SRC = 100,
+   BPFILTER_IP_PRI_SELINUX_LAST= 225,
+   BPFILTER_IP_PRI_CONNTRACK_HELPER= 300,
+   BPFILTER_IP_PRI_CONNTRACK_CONFIRM   = INT_MAX,
+   BPFILTER_IP_PRI_LAST= INT_MAX,
+};
+
+#define BPFILTER_FUNCTION_MAXNAMELEN   30
+#define BPFILTER_EXTENSION_MAXNAMELEN  29
+#define BPFILTER_TABLE_MAXNAMELEN  32
+
+struct bpfilter_match;
+struct bpfilter_entry_match {
+   union {
+   struct {
+   

[PATCH RFC 0/4] net: add bpfilter

2018-02-16 Thread Daniel Borkmann
supports-test: no
supports-eeprom-access: no
supports-register-dump: yes
supports-priv-flags: no

  # iptables -t filter -A INPUT -i enp2s0 -s 192.168.2.2/32 -j DROP

  # bpftool p
  1: xdp  tag 88896d0ae0f463a6 dev enp2s0  ( <-- offloaded into HW )
loaded_at Feb 15/14:30  uid 0
xlated 184B  jited 640B  memlock 4096B
  # bpftool p d x i 1
   0: (bf) r9 = r1
   1: (79) r2 = *(u64 *)(r9 +0)
   2: (79) r3 = *(u64 *)(r9 +8)
   3: (bf) r1 = r2
   4: (07) r1 += 14
   5: (bd) if r1 <= r3 goto pc+2
   6: (b4) (u32) r0 = (u32) 2
   7: (95) exit
   8: (bf) r1 = r2
   9: (b4) (u32) r5 = (u32) 0
  10: (69) r4 = *(u16 *)(r1 +12)
  11: (55) if r4 != 0x8 goto pc+6
  12: (07) r1 += 34
  13: (2d) if r1 > r3 goto pc+4
  14: (07) r1 += -20
  15: (61) r4 = *(u32 *)(r1 +12)
  16: (55) if r4 != 0x202a8c0 goto pc+1
  17: (04) (u32) r5 += (u32) 1
  18: (55) if r5 != 0x1 goto pc+2
  19: (b4) (u32) r0 = (u32) 1
  20: (95) exit
  21: (b4) (u32) r0 = (u32) 2
  22: (95) exit

Thanks!

Alexei Starovoitov (2):
  modules: allow insmod load regular elf binaries
  bpf: introduce bpfilter commands

Daniel Borkmann (1):
  bpf: rough bpfilter codegen example hack

David S. Miller (1):
  net: initial bpfilter skeleton

 fs/exec.c |  40 -
 include/linux/binfmts.h   |   1 +
 include/linux/bpfilter.h  |  13 ++
 include/linux/umh.h   |   4 +
 include/uapi/linux/bpf.h  |  31 
 include/uapi/linux/bpfilter.h | 200 ++
 kernel/bpf/syscall.c  |  52 ++
 kernel/module.c   |  33 +++-
 kernel/umh.c  |  24 ++-
 net/Kconfig   |   2 +
 net/Makefile  |   1 +
 net/bpfilter/Kconfig  |   7 +
 net/bpfilter/Makefile |   9 +
 net/bpfilter/bpfilter.c   | 106 
 net/bpfilter/bpfilter_mod.h   | 373 ++
 net/bpfilter/ctor.c   |  91 +++
 net/bpfilter/gen.c| 290 
 net/bpfilter/init.c   |  36 
 net/bpfilter/sockopt.c| 236 ++
 net/bpfilter/tables.c |  73 +
 net/bpfilter/targets.c|  51 ++
 net/bpfilter/tgts.c   |  26 +++
 net/ipv4/Makefile |   2 +
 net/ipv4/bpfilter/Makefile|   2 +
 net/ipv4/bpfilter/sockopt.c   |  64 
 net/ipv4/ip_sockglue.c|  17 ++
 26 files changed, 1767 insertions(+), 17 deletions(-)
 create mode 100644 include/linux/bpfilter.h
 create mode 100644 include/uapi/linux/bpfilter.h
 create mode 100644 net/bpfilter/Kconfig
 create mode 100644 net/bpfilter/Makefile
 create mode 100644 net/bpfilter/bpfilter.c
 create mode 100644 net/bpfilter/bpfilter_mod.h
 create mode 100644 net/bpfilter/ctor.c
 create mode 100644 net/bpfilter/gen.c
 create mode 100644 net/bpfilter/init.c
 create mode 100644 net/bpfilter/sockopt.c
 create mode 100644 net/bpfilter/tables.c
 create mode 100644 net/bpfilter/targets.c
 create mode 100644 net/bpfilter/tgts.c
 create mode 100644 net/ipv4/bpfilter/Makefile
 create mode 100644 net/ipv4/bpfilter/sockopt.c

-- 
2.9.5

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC 2/4] bpf: introduce bpfilter commands

2018-02-16 Thread Daniel Borkmann
From: Alexei Starovoitov 

Signed-off-by: Alexei Starovoitov 
---
 include/uapi/linux/bpf.h | 16 
 kernel/bpf/syscall.c | 41 +
 2 files changed, 57 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index db6bdc3..ea977e9 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -94,6 +94,8 @@ enum bpf_cmd {
BPF_MAP_GET_FD_BY_ID,
BPF_OBJ_GET_INFO_BY_FD,
BPF_PROG_QUERY,
+   BPFILTER_GET_CMD,
+   BPFILTER_REPLY,
 };
 
 enum bpf_map_type {
@@ -231,6 +233,17 @@ enum bpf_attach_type {
 #define BPF_F_RDONLY   (1U << 3)
 #define BPF_F_WRONLY   (1U << 4)
 
+struct bpfilter_get_cmd {
+   __u32 pid;
+   __u32 cmd;
+   __u64 addr;
+   __u32 len;
+};
+
+struct bpfilter_reply {
+   __u32 status;
+};
+
 union bpf_attr {
struct { /* anonymous struct used by BPF_MAP_CREATE command */
__u32   map_type;   /* one of enum bpf_map_type */
@@ -320,6 +333,9 @@ union bpf_attr {
__aligned_u64   prog_ids;
__u32   prog_cnt;
} query;
+
+   struct bpfilter_get_cmd bpfilter_get_cmd;
+   struct bpfilter_reply bpfilter_reply;
 } __attribute__((aligned(8)));
 
 /* BPF helper function descriptions:
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index e24aa32..e933bf9 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1840,6 +1840,41 @@ static int bpf_obj_get_info_by_fd(const union bpf_attr 
*attr,
return err;
 }
 
+DECLARE_WAIT_QUEUE_HEAD(bpfilter_get_cmd_wq);
+DECLARE_WAIT_QUEUE_HEAD(bpfilter_reply_wq);
+bool bpfilter_get_cmd_ready = false;
+bool bpfilter_reply_ready = false;
+struct bpfilter_get_cmd bpfilter_get_cmd_mbox;
+struct bpfilter_reply bpfilter_reply_mbox;
+
+#define BPFILTER_GET_CMD_LAST_FIELD bpfilter_get_cmd.len
+
+static int bpfilter_get_cmd(const union bpf_attr *attr,
+   union bpf_attr __user *uattr)
+{
+   if (CHECK_ATTR(BPFILTER_GET_CMD))
+   return -EINVAL;
+   wait_event_killable(bpfilter_get_cmd_wq, bpfilter_get_cmd_ready);
+   bpfilter_get_cmd_ready = false;
+   if (copy_to_user(>bpfilter_get_cmd, _get_cmd_mbox,
+sizeof(bpfilter_get_cmd_mbox)))
+   return -EFAULT;
+   return 0;
+}
+
+#define BPFILTER_REPLY_LAST_FIELD bpfilter_reply.status
+
+static int bpfilter_reply(const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+   if (CHECK_ATTR(BPFILTER_REPLY))
+   return -EINVAL;
+   bpfilter_reply_mbox.status = attr->bpfilter_reply.status;
+   bpfilter_reply_ready = true;
+   wake_up(_reply_wq);
+   return 0;
+}
+
 SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, 
size)
 {
union bpf_attr attr = {};
@@ -1917,6 +1952,12 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, 
uattr, unsigned int, siz
case BPF_OBJ_GET_INFO_BY_FD:
err = bpf_obj_get_info_by_fd(, uattr);
break;
+   case BPFILTER_GET_CMD:
+   err = bpfilter_get_cmd(, uattr);
+   break;
+   case BPFILTER_REPLY:
+   err = bpfilter_reply(, uattr);
+   break;
default:
err = -EINVAL;
break;
-- 
2.9.5

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC 1/4] modules: allow insmod load regular elf binaries

2018-02-16 Thread Daniel Borkmann
From: Alexei Starovoitov 

Signed-off-by: Alexei Starovoitov 
---
 fs/exec.c   | 40 +++-
 include/linux/binfmts.h |  1 +
 include/linux/umh.h |  4 
 kernel/module.c | 33 -
 kernel/umh.c| 24 +---
 5 files changed, 85 insertions(+), 17 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 7eb8d21..0483c43 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1691,14 +1691,13 @@ static int exec_binprm(struct linux_binprm *bprm)
 /*
  * sys_execve() executes a new program.
  */
-static int do_execveat_common(int fd, struct filename *filename,
- struct user_arg_ptr argv,
- struct user_arg_ptr envp,
- int flags)
+static int __do_execve_file(int fd, struct filename *filename,
+   struct user_arg_ptr argv,
+   struct user_arg_ptr envp,
+   int flags, struct file *file)
 {
char *pathbuf = NULL;
struct linux_binprm *bprm;
-   struct file *file;
struct files_struct *displaced;
int retval;
 
@@ -1737,7 +1736,8 @@ static int do_execveat_common(int fd, struct filename 
*filename,
check_unsafe_exec(bprm);
current->in_execve = 1;
 
-   file = do_open_execat(fd, filename, flags);
+   if (!file)
+   file = do_open_execat(fd, filename, flags);
retval = PTR_ERR(file);
if (IS_ERR(file))
goto out_unmark;
@@ -1745,7 +1745,9 @@ static int do_execveat_common(int fd, struct filename 
*filename,
sched_exec();
 
bprm->file = file;
-   if (fd == AT_FDCWD || filename->name[0] == '/') {
+   if (!filename) {
+   bprm->filename = "/dev/null";
+   } else if (fd == AT_FDCWD || filename->name[0] == '/') {
bprm->filename = filename->name;
} else {
if (filename->name[0] == '\0')
@@ -1811,7 +1813,8 @@ static int do_execveat_common(int fd, struct filename 
*filename,
task_numa_free(current);
free_bprm(bprm);
kfree(pathbuf);
-   putname(filename);
+   if (filename)
+   putname(filename);
if (displaced)
put_files_struct(displaced);
return retval;
@@ -1834,10 +1837,29 @@ static int do_execveat_common(int fd, struct filename 
*filename,
if (displaced)
reset_files_struct(displaced);
 out_ret:
-   putname(filename);
+   if (filename)
+   putname(filename);
return retval;
 }
 
+static int do_execveat_common(int fd, struct filename *filename,
+ struct user_arg_ptr argv,
+ struct user_arg_ptr envp,
+ int flags)
+{
+   struct file *file = NULL;
+
+   return __do_execve_file(fd, filename, argv, envp, flags, file);
+}
+
+int do_execve_file(struct file *file, void *__argv, void *__envp)
+{
+   struct user_arg_ptr argv = { .ptr.native = __argv };
+   struct user_arg_ptr envp = { .ptr.native = __envp };
+
+   return __do_execve_file(AT_FDCWD, NULL, argv, envp, 0, file);
+}
+
 int do_execve(struct filename *filename,
const char __user *const __user *__argv,
const char __user *const __user *__envp)
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index b0abe21..c783a7b 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -147,5 +147,6 @@ extern int do_execveat(int, struct filename *,
   const char __user * const __user *,
   const char __user * const __user *,
   int);
+int do_execve_file(struct file *file, void *__argv, void *__envp);
 
 #endif /* _LINUX_BINFMTS_H */
diff --git a/include/linux/umh.h b/include/linux/umh.h
index 244aff6..68ddd4f 100644
--- a/include/linux/umh.h
+++ b/include/linux/umh.h
@@ -22,6 +22,7 @@ struct subprocess_info {
const char *path;
char **argv;
char **envp;
+   struct file *file;
int wait;
int retval;
int (*init)(struct subprocess_info *info, struct cred *new);
@@ -38,6 +39,9 @@ call_usermodehelper_setup(const char *path, char **argv, char 
**envp,
  int (*init)(struct subprocess_info *info, struct cred 
*new),
  void (*cleanup)(struct subprocess_info *), void 
*data);
 
+extern struct subprocess_info *
+call_usermodehelper_setup_file(struct file *file);
+
 extern int
 call_usermodehelper_exec(struct subprocess_info *info, int wait);
 
diff --git a/kernel/module.c b/kernel/module.c
index 1d65b2c..b0febe3 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -325,6 +325,7 @@ struct load_info {
struct {
unsigned int sym, str, mod, vers, info, pcpu;
} index;
+   struct file *file;
 };
 

[PATCH RFC 4/4] bpf: rough bpfilter codegen example hack

2018-02-16 Thread Daniel Borkmann
Signed-off-by: Daniel Borkmann <dan...@iogearbox.net>
---
 include/uapi/linux/bpf.h|  31 +++--
 kernel/bpf/syscall.c|  39 +++---
 net/bpfilter/Makefile   |   2 +-
 net/bpfilter/bpfilter.c |  59 +
 net/bpfilter/bpfilter_mod.h | 285 ++-
 net/bpfilter/ctor.c |  57 +
 net/bpfilter/gen.c  | 290 
 net/bpfilter/init.c |  11 +-
 net/bpfilter/sockopt.c  | 137 -
 net/bpfilter/tables.c   |   5 +-
 net/bpfilter/tgts.c |   1 +
 net/ipv4/bpfilter/sockopt.c |  25 +++-
 13 files changed, 835 insertions(+), 109 deletions(-)
 create mode 100644 net/bpfilter/gen.c

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ea977e9..066d76b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -94,8 +94,8 @@ enum bpf_cmd {
BPF_MAP_GET_FD_BY_ID,
BPF_OBJ_GET_INFO_BY_FD,
BPF_PROG_QUERY,
-   BPFILTER_GET_CMD,
-   BPFILTER_REPLY,
+   BPF_MBOX_REQUEST,
+   BPF_MBOX_REPLY,
 };
 
 enum bpf_map_type {
@@ -233,14 +233,29 @@ enum bpf_attach_type {
 #define BPF_F_RDONLY   (1U << 3)
 #define BPF_F_WRONLY   (1U << 4)
 
-struct bpfilter_get_cmd {
-   __u32 pid;
-   __u32 cmd;
+enum bpf_mbox_subsys {
+   BPF_MBOX_SUBSYS_BPFILTER,
+#define BPF_MBOX_SUBSYS_BPFILTER   BPF_MBOX_SUBSYS_BPFILTER
+};
+
+enum bpf_mbox_kind {
+   BPF_MBOX_KIND_SET,
+#define BPF_MBOX_KIND_SET  BPF_MBOX_KIND_SET
+   BPF_MBOX_KIND_GET,
+#define BPF_MBOX_KIND_GET  BPF_MBOX_KIND_GET
+};
+
+struct bpf_mbox_request {
__u64 addr;
__u32 len;
+   __u32 subsys;
+   __u32 kind;
+   __u32 cmd;
+   __u32 pid;
 };
 
-struct bpfilter_reply {
+struct bpf_mbox_reply {
+   __u32 subsys;
__u32 status;
 };
 
@@ -334,8 +349,8 @@ union bpf_attr {
__u32   prog_cnt;
} query;
 
-   struct bpfilter_get_cmd bpfilter_get_cmd;
-   struct bpfilter_reply bpfilter_reply;
+   struct bpf_mbox_request mbox_request;
+   struct bpf_mbox_reply   mbox_reply;
 } __attribute__((aligned(8)));
 
 /* BPF helper function descriptions:
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index e933bf9..2feb438 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1842,36 +1842,47 @@ static int bpf_obj_get_info_by_fd(const union bpf_attr 
*attr,
 
 DECLARE_WAIT_QUEUE_HEAD(bpfilter_get_cmd_wq);
 DECLARE_WAIT_QUEUE_HEAD(bpfilter_reply_wq);
+
 bool bpfilter_get_cmd_ready = false;
 bool bpfilter_reply_ready = false;
-struct bpfilter_get_cmd bpfilter_get_cmd_mbox;
-struct bpfilter_reply bpfilter_reply_mbox;
 
-#define BPFILTER_GET_CMD_LAST_FIELD bpfilter_get_cmd.len
+struct bpf_mbox_request bpfilter_get_cmd_mbox;
+struct bpf_mbox_reply   bpfilter_reply_mbox;
+
+#define BPF_MBOX_REQUEST_LAST_FIELDmbox_request.pid
 
-static int bpfilter_get_cmd(const union bpf_attr *attr,
+static int bpf_mbox_request(const union bpf_attr *attr,
union bpf_attr __user *uattr)
 {
-   if (CHECK_ATTR(BPFILTER_GET_CMD))
+   if (CHECK_ATTR(BPF_MBOX_REQUEST))
return -EINVAL;
+   if (attr->mbox_request.subsys != BPF_MBOX_SUBSYS_BPFILTER)
+   return -ENOTSUPP;
+
wait_event_killable(bpfilter_get_cmd_wq, bpfilter_get_cmd_ready);
bpfilter_get_cmd_ready = false;
-   if (copy_to_user(>bpfilter_get_cmd, _get_cmd_mbox,
+
+   if (copy_to_user(>mbox_request, _get_cmd_mbox,
 sizeof(bpfilter_get_cmd_mbox)))
return -EFAULT;
return 0;
 }
 
-#define BPFILTER_REPLY_LAST_FIELD bpfilter_reply.status
+#define BPF_MBOX_REPLY_LAST_FIELD  mbox_reply.status
 
-static int bpfilter_reply(const union bpf_attr *attr,
+static int bpf_mbox_reply(const union bpf_attr *attr,
  union bpf_attr __user *uattr)
 {
-   if (CHECK_ATTR(BPFILTER_REPLY))
+   if (CHECK_ATTR(BPF_MBOX_REPLY))
return -EINVAL;
-   bpfilter_reply_mbox.status = attr->bpfilter_reply.status;
+   if (attr->mbox_reply.subsys != BPF_MBOX_SUBSYS_BPFILTER)
+   return -ENOTSUPP;
+
+   bpfilter_reply_mbox.subsys = attr->mbox_reply.subsys;
+   bpfilter_reply_mbox.status = attr->mbox_reply.status;
bpfilter_reply_ready = true;
wake_up(_reply_wq);
+
return 0;
 }
 
@@ -1952,11 +1963,11 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, 
uattr, unsigned int, siz
case BPF_OBJ_GET_INFO_BY_FD:
err = bpf_obj_get_info_by_fd(, uattr);
break;
-   case BPFILTER_GET_CMD:
-   err = bpfilter_get_cmd(, uattr);
+   case BPF_MBOX_REQUEST:
+   err = bpf_mbox_request(, uattr);
break;
-   case BPFILTER_REPLY:
-   err = bpfilter_repl

Re: [PATCH v2] netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'

2017-10-09 Thread Daniel Borkmann

On 10/09/2017 02:27 PM, Shmulik Ladkani wrote:

From: Shmulik Ladkani <shmulik.ladk...@gmail.com>

Commit 2c16d6033264 ("netfilter: xt_bpf: support ebpf") introduced
support for attaching an eBPF object by an fd, with the
'bpf_mt_check_v1' ABI expecting the '.fd' to be specified upon each
IPT_SO_SET_REPLACE call.

However this breaks subsequent iptables calls:

  # iptables -A INPUT -m bpf --object-pinned /sys/fs/bpf/xxx -j ACCEPT
  # iptables -A INPUT -s 5.6.7.8 -j ACCEPT
  iptables: Invalid argument. Run `dmesg' for more information.

That's because iptables works by loading exising rules using
IPT_SO_GET_ENTRIES to userspace, then issuing IPT_SO_SET_REPLACE with
the replacement set.

However, the loaded 'xt_bpf_info_v1' has an arbitrary '.fd' number
(from the initial "iptables -m bpf" invocation) - so when 2nd invocation
occurs, userspace passes a bogus fd number, which leads to
'bpf_mt_check_v1' to fail.

One suggested solution [1] was to hack iptables userspace, to perform a
"entries fixup" immediatley after IPT_SO_GET_ENTRIES, by opening a new,
process-local fd per every 'xt_bpf_info_v1' entry seen.

However, in [2] both Pablo Neira Ayuso and Willem de Bruijn suggested to
depricate the xt_bpf_info_v1 ABI dealing with pinned ebpf objects.

This fix changes the XT_BPF_MODE_FD_PINNED behavior to ignore the given
'.fd' and instead perform an in-kernel lookup for the bpf object given
the provided '.path'.

It also defines an alias for the XT_BPF_MODE_FD_PINNED mode, named
XT_BPF_MODE_PATH_PINNED, to better reflect the fact that the user is
expected to provide the path of the pinned object.

Existing XT_BPF_MODE_FD_ELF behavior (non-pinned fd mode) is preserved.

References: [1] https://marc.info/?l=netfilter-devel=150564724607440=2
 [2] https://marc.info/?l=netfilter-devel=150575727129880=2

Cc: Pablo Neira Ayuso <pa...@netfilter.org>
Cc: Willem de Bruijn <will...@google.com>
Reported-by: Rafael Buchbinder <r...@rbk.ms>
Signed-off-by: Shmulik Ladkani <shmulik.ladk...@gmail.com>


Acked-by: Daniel Borkmann <dan...@iogearbox.net>
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'

2017-10-09 Thread Daniel Borkmann

Hi Shmulik,

On 10/09/2017 01:57 PM, Pablo Neira Ayuso wrote:

On Mon, Oct 09, 2017 at 01:18:23PM +0200, Pablo Neira Ayuso wrote:

On Fri, Oct 06, 2017 at 01:40:13PM -0400, Willem de Bruijn wrote:

On Fri, Oct 6, 2017 at 12:02 PM, Shmulik Ladkani  wrote:

From: Shmulik Ladkani 

Commit 2c16d6033264 ("netfilter: xt_bpf: support ebpf") introduced
support for attaching an eBPF object by an fd, with the
'bpf_mt_check_v1' ABI expecting the '.fd' to be specified upon each
IPT_SO_SET_REPLACE call.

However this breaks subsequent iptables calls:

  # iptables -A INPUT -m bpf --object-pinned /sys/fs/bpf/xxx -j ACCEPT
  # iptables -A INPUT -s 5.6.7.8 -j ACCEPT
  iptables: Invalid argument. Run `dmesg' for more information.

[...]


References: [1] https://marc.info/?l=netfilter-devel=150564724607440=2
 [2] https://marc.info/?l=netfilter-devel=150575727129880=2

Cc: Pablo Neira Ayuso 
Cc: Willem de Bruijn 
Reported-by: Rafael Buchbinder 
Signed-off-by: Shmulik Ladkani 


Acked-by: Willem de Bruijn 


Applied, thanks.


Hm, I have to keep this back. Compilation breaks here.

net/netfilter/xt_bpf.c: In function ‘__bpf_mt_check_path’:
net/netfilter/xt_bpf.c:59:2: error: implicit declaration of function
‘bpf_obj_get_user’ [-Werror=implicit-function-declaration]
   fd = bpf_obj_get_user(path);
   ^


Yeah, probably best to just add a dummy bpf_obj_get_user()
returning an error when CONFIG_BPF_SYSCALL is disabled.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nf-next] netfilter: xt_bpf: support ebpf

2016-12-05 Thread Daniel Borkmann

Hi Willem,

On 12/05/2016 09:28 PM, Willem de Bruijn wrote:

From: Willem de Bruijn 

Add support for attaching an eBPF object by file descriptor.

The iptables binary can be called with a path to an elf object or a
pinned bpf object. Also pass the mode and path to the kernel to be
able to return it later for iptables dump and save.

Signed-off-by: Willem de Bruijn 


just out of pure curiosity, use case is for android guys wrt
accounting, or anything specific that cls_bpf on tc ingress +
egress cannot do already?

Thanks,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] nfnetlink_queue: enable PID info retrieval

2016-06-09 Thread Daniel Borkmann

On 06/10/2016 12:21 AM, Daniel Borkmann wrote:

On 06/09/2016 11:35 PM, Florian Westphal wrote:

Saeed Mahameed <sae...@mellanox.com> wrote:

index a1bd161..67de200 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -382,6 +382,7 @@ struct file *sock_alloc_file(struct socket *sock, int 
flags, const char *dname)
  }

  sock->file = file;
+file->f_owner.sock_pid  = find_get_pid(task_pid_nr(current));
  file->f_flags = O_RDWR | (flags & O_NONBLOCK);
  file->private_data = sock;
  return file;


This looks like this leaks sock_pid reference...?

(find_get_pid -> get_pid -> atomic_inc() , I don't see a put_pid in the
  patch)

Can't comment further than this since I'm not familiar with vfs; e.g.
I can't say if fown_struct is right place or not, or if this approach
even works when creating process has exited after fork, etc.


Or ... if you xmit the fd via unix domain socket to a different process
and initial owner terminates, which should give you invalid information
then; afaik, this would just increase struct file's refcnt and hand out
an unused fdnum ( get_unused_fd_flags() + fd_install(), etc).
For extending 'struct fown_struct', you probably also need to Cc fs folks.


[ Cc'ing John, Daniel, et al ]

Btw, while I just looked at scm_detach_fds(), I think commits ...

 * 48a87cc26c13 ("net: netprio: fd passed in SCM_RIGHTS datagram not set 
correctly")
 * d84295067fc7 ("net: net_cls: fd passed in SCM_RIGHTS datagram not set 
correctly")

... might not be correct, maybe I'm missing something ...? Lets say process A
has a socket fd that it sends via SCM_RIGHTS to process B. Process A was the
one that called sk_alloc() originally. Now in scm_detach_fds() we install a new
fd for process B pointing to the same sock (file's private_data) and above 
commits
update the cached socket cgroup data for net_cls/net_prio to the new process B.
So, if process A for example still sends data over that socket, skbs will then
wrongly match on B's cgroup membership instead of A's, no?

Thanks,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] nfnetlink_queue: enable PID info retrieval

2016-06-09 Thread Daniel Borkmann

On 06/09/2016 11:35 PM, Florian Westphal wrote:

Saeed Mahameed  wrote:

index a1bd161..67de200 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -382,6 +382,7 @@ struct file *sock_alloc_file(struct socket *sock, int 
flags, const char *dname)
}

sock->file = file;
+   file->f_owner.sock_pid  = find_get_pid(task_pid_nr(current));
file->f_flags = O_RDWR | (flags & O_NONBLOCK);
file->private_data = sock;
return file;


This looks like this leaks sock_pid reference...?

(find_get_pid -> get_pid -> atomic_inc() , I don't see a put_pid in the
  patch)

Can't comment further than this since I'm not familiar with vfs; e.g.
I can't say if fown_struct is right place or not, or if this approach
even works when creating process has exited after fork, etc.


Or ... if you xmit the fd via unix domain socket to a different process
and initial owner terminates, which should give you invalid information
then; afaik, this would just increase struct file's refcnt and hand out
an unused fdnum ( get_unused_fd_flags() + fd_install(), etc).
For extending 'struct fown_struct', you probably also need to Cc fs folks.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nf-next] netfilter: meta: add PRANDOM support

2016-02-16 Thread Daniel Borkmann

On 02/16/2016 02:19 PM, Florian Westphal wrote:

Daniel Borkmann <dan...@iogearbox.net> wrote:

+   case NFT_META_PRANDOM:
+   if (!prand_inited) {
+   prandom_seed_full_state(_prandom_state);
+   prand_inited = true;
+   }


Should this be: prandom_init_once() ?


Thought about that but this is slowpath so I considered
the use of static key magic a bit overkill

I don't mind, if you think prandom_init_once is prefereable I'll respin.


You'd have the benefit that the prng init would be race free. 
nft_meta_get_init()
could be called in parallel from multiple CPUs, right?


Let me know, thanks!

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nf-next] netfilter: meta: add PRANDOM support

2016-02-16 Thread Daniel Borkmann

Hi Florian,

On 02/16/2016 01:09 PM, Florian Westphal wrote:

Can be used to randomly match a packet, e.g. for statistical traffic
sampling.

See commit 3ad0040573b0c00f8848
("bpf: split state from prandom_u32() and consolidate {c, e}BPF prngs")
for more info why this doesn't use prandom_u32 directly.

Unlike bpf nft_meta can be built as a module, so add an EXPORT_SYMBOL
for prandom_seed_full_state too.

Cc: Daniel Borkmann <dan...@iogearbox.net>
Signed-off-by: Florian Westphal <f...@strlen.de>

[...]

@@ -241,6 +248,7 @@ int nft_meta_get_init(const struct nft_ctx *ctx,
  {
struct nft_meta *priv = nft_expr_priv(expr);
unsigned int len;
+   static bool prand_inited __read_mostly;

priv->key = ntohl(nla_get_be32(tb[NFTA_META_KEY]));
switch (priv->key) {
@@ -277,6 +285,13 @@ int nft_meta_get_init(const struct nft_ctx *ctx,
case NFT_META_OIFNAME:
len = IFNAMSIZ;
break;
+   case NFT_META_PRANDOM:
+   if (!prand_inited) {
+   prandom_seed_full_state(_prandom_state);
+   prand_inited = true;
+   }


Should this be: prandom_init_once() ?


+   len = sizeof(u32);
+   break;
default:
return -EOPNOTSUPP;
}



Thanks,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html