Re: nft/bpf interpreters and spectre2. Was: [PATCH RFC 0/4] net: add bpfilter
[resend as plaintext, apparently mobile gmail will send HTML mails] On Thu, Feb 22, 2018 at 3:20 AM, Alexei Starovoitov wrote: > On Wed, Feb 21, 2018 at 01:13:03PM +0100, Florian Westphal wrote: >> >> Obvious candidates are: meta, numgen, limit, objref, quota, reject. >> >> We should probably also consider removing >> CONFIG_NFT_SET_RBTREE and CONFIG_NFT_SET_HASH and just always >> build both too (at least rbtree since that offers interval). >> >> For the indirect call issue we can use direct calls from eval loop for >> some of the more frequently used ones, similar to what we do already >> for nft_cmp_fast_expr. > > nft_cmp_fast_expr and other expressions mentioned above made me thinking... > > do we have the same issue with nft interpreter as we had with bpf one? > bpf interpreter was used as part of spectre2 attack to leak > information via cache side channel and let VM read hypervisor memory. > Due to that issue we removed bpf interpreter from the kernel code. > That's what CONFIG_BPF_JIT_ALWAYS_ON for... > but we still have nft interpreter in the kernel that can also > execute arbitrary nft expressions. > > Jann's exploit used the following bpf instructions: [...] > > and a gadget to jump into __bpf_prog_run with insn pointing > to memory controlled by the guest while accessible > (at different virt address) by the hypervisor. > > It seems possible to construct similar sequence of instructions > out of nft expressions and use gadget that jumps into nft_do_chain(). [...] > Obviously such exploit is harder to do than bpf based one. > Do we need to do anything about it ? > May be it's easier to find gadgets in .text of vmlinux > instead of messing with interpreters? > > Jann, > can you comment on removing interpreters in general? > Do we need to worry about having bpf and/or nft interpreter > in the kernel? I think that for Spectre V2, the presence of interpreters isn't a big problem. It simplifies writing attacks a bit, but I don't expect it to be necessary if an attacker invests some time into finding useful gadgets.
Re: nft/bpf interpreters and spectre2. Was: [PATCH RFC 0/4] net: add bpfilter
On Thu, Feb 22, 2018 at 12:39:15PM +0100, Pablo Neira Ayuso wrote: > Hi Alexei, > > On Wed, Feb 21, 2018 at 06:20:37PM -0800, Alexei Starovoitov wrote: > > On Wed, Feb 21, 2018 at 01:13:03PM +0100, Florian Westphal wrote: > > > > > > Obvious candidates are: meta, numgen, limit, objref, quota, reject. > > > > > > We should probably also consider removing > > > CONFIG_NFT_SET_RBTREE and CONFIG_NFT_SET_HASH and just always > > > build both too (at least rbtree since that offers interval). > > > > > > For the indirect call issue we can use direct calls from eval loop for > > > some of the more frequently used ones, similar to what we do already > > > for nft_cmp_fast_expr. > > > > nft_cmp_fast_expr and other expressions mentioned above made me thinking... > > > > do we have the same issue with nft interpreter as we had with bpf one? > > bpf interpreter was used as part of spectre2 attack to leak > > information via cache side channel and let VM read hypervisor memory. > > Due to that issue we removed bpf interpreter from the kernel code. > > That's what CONFIG_BPF_JIT_ALWAYS_ON for... > > but we still have nft interpreter in the kernel that can also > > execute arbitrary nft expressions. > > > > Jann's exploit used the following bpf instructions: > > struct bpf_insn evil_bytecode_instrs[] = { > > // rax = target_byte_addr > > { .code = BPF_LD | BPF_IMM | BPF_DW, .dst_reg = 0, .imm = target_byte_addr > > }, { .imm = target_byte_addr>>32 }, > > We don't place pointers in the nft VM registers, it's basically > illegal to do so, otherwise we would need more sophisticated verifier. > I'm telling this because we don't have a way to point to any arbitrary > address as in 'target_byte_addr' above. these evil_bytecode_instrs never saw bpf verifier either. That's the scary part of that poc. The only requirement for poc to work is to have interpreter in executable part of hypervisor code and speculatively jump into it with arguments pointing to memory controlled by vm. All static checks (done by bpf verifier and by nft validation) are bypassed. The only way to defend from such exploit is either remove the interpreter from the kernel or add _run-time_ checks and masks for every memory access (similar to what is done for spectre1 mitigations). In case of bpf it's impractical. In case of nft I suspect so too. I don't yet see how nft can check that skb pointer passed as part of nft_pktinfo is not an actual skb. > > // rdi = timing_leak_array > > { .code = BPF_LD | BPF_IMM | BPF_DW, .dst_reg = 1, .imm = > > host_timing_leak_addr }, { .imm = host_timing_leak_addr>>32 }, > > // rax = *(u8*)rax > > { .code = BPF_LDX | BPF_MEM | BPF_B, .dst_reg = 0, .src_reg = 0, .off = 0 }, > > // rax = rax << ... > > { .code = BPF_ALU64 | BPF_LSH | BPF_K, .dst_reg = 0, .imm = 10 - bit_idx }, > > // rax = rax & 0x400 > > { .code = BPF_ALU64 | BPF_AND | BPF_K, .dst_reg = 0, .imm = 0x400 }, > > // rax = rdi + rax > > { .code = BPF_ALU64 | BPF_ADD | BPF_X, .dst_reg = 0, .src_reg = 1 }, > > // *(u8*) (rax + 0x800) > > { .code = BPF_LDX | BPF_MEM | BPF_B, .dst_reg = 0, .src_reg = 0, .off = > > 0x800 }, > > > > and a gadget to jump into __bpf_prog_run with insn pointing > > to memory controlled by the guest while accessible > > (at different virt address) by the hypervisor. > > > > It seems possible to construct similar sequence of instructions > > out of nft expressions and use gadget that jumps into nft_do_chain(). > > The attacker would need to discover more kernel addresses: > > nft_do_chain, nft_cmp_fast_ops, nft_payload_fast_ops, nft_bitwise_eval, > > nft_lookup_eval, and nft_bitmap_lookup > > to populate nft chains, rules and expressions in guest memory > > comparing to bpf interpreter attack. > > > > Then in nft_do_chain(struct nft_pktinfo *pkt, void *priv) > > pkt needs to point to fake struct sk_buff in guest memory with > > skb->head == target_byte_addr > > We don't have a way to make this point to fake struct sk_buff. yet. it's possible, since cpu is speculating and all such pointers controlled by vm can be arbitrary. > > The first nft expression can be nft_payload_fast_eval(). > > If it's properly constructed with > > (nft_payload->based == NFT_PAYLOAD_NETWORK_HEADER, offset == 0, len == 0, > > dreg == 1) > > We can reject len == 0. To be honest, this is not done right now, but > we can place a patch to validate this. Given this is a specialized > networking virtual machine, it retain semantics, so fetching zero > length data from a skbuff makes no sense, hence, we can return EINVAL > via netlink when adding a rule that tries to do this. Adding static check won't help.
Re: nft/bpf interpreters and spectre2. Was: [PATCH RFC 0/4] net: add bpfilter
Hi Alexei, On Wed, Feb 21, 2018 at 06:20:37PM -0800, Alexei Starovoitov wrote: > On Wed, Feb 21, 2018 at 01:13:03PM +0100, Florian Westphal wrote: > > > > Obvious candidates are: meta, numgen, limit, objref, quota, reject. > > > > We should probably also consider removing > > CONFIG_NFT_SET_RBTREE and CONFIG_NFT_SET_HASH and just always > > build both too (at least rbtree since that offers interval). > > > > For the indirect call issue we can use direct calls from eval loop for > > some of the more frequently used ones, similar to what we do already > > for nft_cmp_fast_expr. > > nft_cmp_fast_expr and other expressions mentioned above made me thinking... > > do we have the same issue with nft interpreter as we had with bpf one? > bpf interpreter was used as part of spectre2 attack to leak > information via cache side channel and let VM read hypervisor memory. > Due to that issue we removed bpf interpreter from the kernel code. > That's what CONFIG_BPF_JIT_ALWAYS_ON for... > but we still have nft interpreter in the kernel that can also > execute arbitrary nft expressions. > > Jann's exploit used the following bpf instructions: > struct bpf_insn evil_bytecode_instrs[] = { > // rax = target_byte_addr > { .code = BPF_LD | BPF_IMM | BPF_DW, .dst_reg = 0, .imm = target_byte_addr }, > { .imm = target_byte_addr>>32 }, We don't place pointers in the nft VM registers, it's basically illegal to do so, otherwise we would need more sophisticated verifier. I'm telling this because we don't have a way to point to any arbitrary address as in 'target_byte_addr' above. > // rdi = timing_leak_array > { .code = BPF_LD | BPF_IMM | BPF_DW, .dst_reg = 1, .imm = > host_timing_leak_addr }, { .imm = host_timing_leak_addr>>32 }, > // rax = *(u8*)rax > { .code = BPF_LDX | BPF_MEM | BPF_B, .dst_reg = 0, .src_reg = 0, .off = 0 }, > // rax = rax << ... > { .code = BPF_ALU64 | BPF_LSH | BPF_K, .dst_reg = 0, .imm = 10 - bit_idx }, > // rax = rax & 0x400 > { .code = BPF_ALU64 | BPF_AND | BPF_K, .dst_reg = 0, .imm = 0x400 }, > // rax = rdi + rax > { .code = BPF_ALU64 | BPF_ADD | BPF_X, .dst_reg = 0, .src_reg = 1 }, > // *(u8*) (rax + 0x800) > { .code = BPF_LDX | BPF_MEM | BPF_B, .dst_reg = 0, .src_reg = 0, .off = 0x800 > }, > > and a gadget to jump into __bpf_prog_run with insn pointing > to memory controlled by the guest while accessible > (at different virt address) by the hypervisor. > > It seems possible to construct similar sequence of instructions > out of nft expressions and use gadget that jumps into nft_do_chain(). > The attacker would need to discover more kernel addresses: > nft_do_chain, nft_cmp_fast_ops, nft_payload_fast_ops, nft_bitwise_eval, > nft_lookup_eval, and nft_bitmap_lookup > to populate nft chains, rules and expressions in guest memory > comparing to bpf interpreter attack. > > Then in nft_do_chain(struct nft_pktinfo *pkt, void *priv) > pkt needs to point to fake struct sk_buff in guest memory with > skb->head == target_byte_addr We don't have a way to make this point to fake struct sk_buff. > The first nft expression can be nft_payload_fast_eval(). > If it's properly constructed with > (nft_payload->based == NFT_PAYLOAD_NETWORK_HEADER, offset == 0, len == 0, > dreg == 1) We can reject len == 0. To be honest, this is not done right now, but we can place a patch to validate this. Given this is a specialized networking virtual machine, it retain semantics, so fetching zero length data from a skbuff makes no sense, hence, we can return EINVAL via netlink when adding a rule that tries to do this. > it will do arbitrary load of > *(u8 *)dest = *(u8 *)ptr; > from target_byte_addr into register 1 of nft state machine > (dest is u32 array of registers in the stack of nft_do_chain) > Second nft expression can be nft_bitwise_eval() to mask particular > bit in register 1. > Then nft_cmp_eval() to check whether bit is one or zero and > conditional NFT_BREAK out of first nft expression into second nft rule. > The last conditional nft_immediate_eval() in the first rule will set > register 1 to 0x400 * 8 while the first nft_bitwise_eval() in > the second rule with do r1 &= 0x400 * 8. > So at this point r1 will have either 0x400 * 8 or 0 depending > on value of speculatively loaded bit. > The last expression can be nft_lookup_eval() with > nft_lookup->set->ops->lookup == nft_bitmap_lookup > which will do nft_bitmap->bitmap[idx] where idx = r1 / 8 > The memory used for this last nft_lookup/bitmap expression is > both an instruction and timing_leak_array itself. > If I'm not mistaken, this sequence of nft expression will > speculatively execute very similar logic as in evil_bytecode_instrs[] My impression is that several assumptions above are not correct. > The amount of actual speculative native cpu load/stores/branches is > probably more than executed by bpf interpreter for these evil bytecodes, > but likely well within cpu speculation window of 100+ insns. > > Obviously such exploit is harder to do than bpf
nft/bpf interpreters and spectre2. Was: [PATCH RFC 0/4] net: add bpfilter
On Wed, Feb 21, 2018 at 01:13:03PM +0100, Florian Westphal wrote: > > Obvious candidates are: meta, numgen, limit, objref, quota, reject. > > We should probably also consider removing > CONFIG_NFT_SET_RBTREE and CONFIG_NFT_SET_HASH and just always > build both too (at least rbtree since that offers interval). > > For the indirect call issue we can use direct calls from eval loop for > some of the more frequently used ones, similar to what we do already > for nft_cmp_fast_expr. nft_cmp_fast_expr and other expressions mentioned above made me thinking... do we have the same issue with nft interpreter as we had with bpf one? bpf interpreter was used as part of spectre2 attack to leak information via cache side channel and let VM read hypervisor memory. Due to that issue we removed bpf interpreter from the kernel code. That's what CONFIG_BPF_JIT_ALWAYS_ON for... but we still have nft interpreter in the kernel that can also execute arbitrary nft expressions. Jann's exploit used the following bpf instructions: struct bpf_insn evil_bytecode_instrs[] = { // rax = target_byte_addr { .code = BPF_LD | BPF_IMM | BPF_DW, .dst_reg = 0, .imm = target_byte_addr }, { .imm = target_byte_addr>>32 }, // rdi = timing_leak_array { .code = BPF_LD | BPF_IMM | BPF_DW, .dst_reg = 1, .imm = host_timing_leak_addr }, { .imm = host_timing_leak_addr>>32 }, // rax = *(u8*)rax { .code = BPF_LDX | BPF_MEM | BPF_B, .dst_reg = 0, .src_reg = 0, .off = 0 }, // rax = rax << ... { .code = BPF_ALU64 | BPF_LSH | BPF_K, .dst_reg = 0, .imm = 10 - bit_idx }, // rax = rax & 0x400 { .code = BPF_ALU64 | BPF_AND | BPF_K, .dst_reg = 0, .imm = 0x400 }, // rax = rdi + rax { .code = BPF_ALU64 | BPF_ADD | BPF_X, .dst_reg = 0, .src_reg = 1 }, // *(u8*) (rax + 0x800) { .code = BPF_LDX | BPF_MEM | BPF_B, .dst_reg = 0, .src_reg = 0, .off = 0x800 }, and a gadget to jump into __bpf_prog_run with insn pointing to memory controlled by the guest while accessible (at different virt address) by the hypervisor. It seems possible to construct similar sequence of instructions out of nft expressions and use gadget that jumps into nft_do_chain(). The attacker would need to discover more kernel addresses: nft_do_chain, nft_cmp_fast_ops, nft_payload_fast_ops, nft_bitwise_eval, nft_lookup_eval, and nft_bitmap_lookup to populate nft chains, rules and expressions in guest memory comparing to bpf interpreter attack. Then in nft_do_chain(struct nft_pktinfo *pkt, void *priv) pkt needs to point to fake struct sk_buff in guest memory with skb->head == target_byte_addr The first nft expression can be nft_payload_fast_eval(). If it's properly constructed with (nft_payload->based == NFT_PAYLOAD_NETWORK_HEADER, offset == 0, len == 0, dreg == 1) it will do arbitrary load of *(u8 *)dest = *(u8 *)ptr; from target_byte_addr into register 1 of nft state machine (dest is u32 array of registers in the stack of nft_do_chain) Second nft expression can be nft_bitwise_eval() to mask particular bit in register 1. Then nft_cmp_eval() to check whether bit is one or zero and conditional NFT_BREAK out of first nft expression into second nft rule. The last conditional nft_immediate_eval() in the first rule will set register 1 to 0x400 * 8 while the first nft_bitwise_eval() in the second rule with do r1 &= 0x400 * 8. So at this point r1 will have either 0x400 * 8 or 0 depending on value of speculatively loaded bit. The last expression can be nft_lookup_eval() with nft_lookup->set->ops->lookup == nft_bitmap_lookup which will do nft_bitmap->bitmap[idx] where idx = r1 / 8 The memory used for this last nft_lookup/bitmap expression is both an instruction and timing_leak_array itself. If I'm not mistaken, this sequence of nft expression will speculatively execute very similar logic as in evil_bytecode_instrs[] The amount of actual speculative native cpu load/stores/branches is probably more than executed by bpf interpreter for these evil bytecodes, but likely well within cpu speculation window of 100+ insns. Obviously such exploit is harder to do than bpf based one. Do we need to do anything about it ? May be it's easier to find gadgets in .text of vmlinux instead of messing with interpreters? Jann, can you comment on removing interpreters in general? Do we need to worry about having bpf and/or nft interpreter in the kernel?
Re: [PATCH RFC 0/4] net: add bpfilter
Pablo Neira Ayuso wrote: > On Tue, Feb 20, 2018 at 05:52:54PM -0800, Alexei Starovoitov wrote: > > On Tue, Feb 20, 2018 at 11:44:31AM +0100, Pablo Neira Ayuso wrote: > > > > > > Don't get me wrong, no software is safe from security issues, but if you > > > don't abstract your resources in the right way, you have more chance to > > > have experimence more problems. > > > > interesting point. > > The key part of iptables and nft design is heavy use of indirect calls. > > The execution of single iptable rule is ~3 indirect calls. > > Quite a lot worse in nft where every expression is an indirect call. > > That's right. Netfilter is probably too modular, probably we can > revisit this to find a better balance, actually Felix Fietkau was > recently rising concerns on this, specifically in environments with > limited space to store the kernel image. We'll have a look, thanks for > remind us about this. Agree, we have too many config knobs, probably a good idea to turn some modules into plain .o (like cmp and bitwise). Obvious candidates are: meta, numgen, limit, objref, quota, reject. We should probably also consider removing CONFIG_NFT_SET_RBTREE and CONFIG_NFT_SET_HASH and just always build both too (at least rbtree since that offers interval). For the indirect call issue we can use direct calls from eval loop for some of the more frequently used ones, similar to what we do already for nft_cmp_fast_expr. But maybe we don't even have to if we can get help to build a jitter that takes an nftables netlink table dump and builds jit code from that.
Re: [PATCH RFC 0/4] net: add bpfilter
On Tue, Feb 20, 2018 at 05:52:54PM -0800, Alexei Starovoitov wrote: > On Tue, Feb 20, 2018 at 11:44:31AM +0100, Pablo Neira Ayuso wrote: > > > > Don't get me wrong, no software is safe from security issues, but if you > > don't abstract your resources in the right way, you have more chance to > > have experimence more problems. > > interesting point. > The key part of iptables and nft design is heavy use of indirect calls. > The execution of single iptable rule is ~3 indirect calls. > Quite a lot worse in nft where every expression is an indirect call. That's right. Netfilter is probably too modular, probably we can revisit this to find a better balance, actually Felix Fietkau was recently rising concerns on this, specifically in environments with limited space to store the kernel image. We'll have a look, thanks for remind us about this. [...] > CPUs will eventually be fixed and IBRS_ALL will become reality. > Until then the kernel has to deal with the performance issues. Hopefully, so we can all skip these problems. Thanks.
Re: [PATCH RFC 0/4] net: add bpfilter
On Tue, Feb 20, 2018 at 11:44:31AM +0100, Pablo Neira Ayuso wrote: > > Don't get me wrong, no software is safe from security issues, but if you > don't abstract your resources in the right way, you have more chance to > have experimence more problems. interesting point. The key part of iptables and nft design is heavy use of indirect calls. The execution of single iptable rule is ~3 indirect calls. Quite a lot worse in nft where every expression is an indirect call. If my math is correct even simplest nft rule will get to ~10. It was all fine until spectre2 was discovered and retpoline now adds 20-30 cycles for each indirect call. To put numbers in perspective the simple for(...) indirect_call(); loop without retpoline does ~500 M iterations per second on 2+Ghz xeon. clang -mretpoline gcc -mindirect-branch=thunk gcc -mindirect-branch=thunk-inline produce slightly different code with performance of 80-90 M iterations per second for the above loop. Looks like iptables/nft did not abstract the resources in the right way and now experiences more problems. CPUs will eventually be fixed and IBRS_ALL will become reality. Until then the kernel has to deal with the performance issues. bpf and the networking stack will suffer from retpoline as well and we need to work asap on devirtualization and other ideas. For xdp a single indirect call we do per packet (to call into bpf prog) is noticeable and we're experimenting with static_key-like approach to call bpf program with direct call. bpf_tail_calls will suffer too and cannot be accelerated as-is. To solve that we're working on dynamic linking via verifier improvements. C based bpf programs will use normal indirect calls, but verifier will replace indirect with direct at pointer update time. It's not going to be easy, but bpf and stack is fixable, whereas iptables/nft are going to suffer until fixed CPUs find their way into servers years from now.
Re: [PATCH RFC 0/4] net: add bpfilter
Hi Michal, On Tue, Feb 20, 2018 at 10:35:41AM +0100, Michal Kubecek wrote: > On Mon, Feb 19, 2018 at 06:09:39PM +0100, Phil Sutter wrote: > > What puzzles me about your argumentation is that you seem to propose for > > the kernel to cover up flaws in userspace. Spinning this concept further > > would mean that if there would be an old bug in iproute2 we should think > > of adding a workaround to rtnetlink interface in kernel because > > containers will keep the old iproute2 binary? Or am I (hopefully) just > > missing your point? > > Actually, that's what we already do. This is from rtnl_dump_ifinfo(): > > /* A hack to preserve kernel<->userspace interface. >* The correct header is ifinfomsg. It is consistent with rtnl_getlink. >* However, before Linux v3.9 the code here assumed rtgenmsg and that's >* what iproute2 < v3.9.0 used. >* We can detect the old iproute2. Even including the IFLA_EXT_MASK >* attribute, its netlink message is shorter than struct ifinfomsg. >*/ The reason why this is in place (and should be IMHO) is that commit 88c5b5ce5cb57 ("rtnetlink: Call nlmsg_parse() with correct header length") incompatibly changed uAPI. I have a different example which reflects what I have in mind, namely iproute2 commit 33f6dd23a51c4 ("ip fou: pass family attribute as u8") which basically does: | - addattr16(n, 1024, FOU_ATTR_AF, family); | + addattr8(n, 1024, FOU_ATTR_AF, family); If kernel cares about those userspace bugs, shouldn't the better fix be to make it expect u16 in FOU_ATTR_AF and check whether the high or low byte contains the expected value? Cheers, Phil
Re: [PATCH RFC 0/4] net: add bpfilter
From: Pablo Neira Ayuso Date: Tue, 20 Feb 2018 11:44:31 +0100 > * Lack of sufficient abstraction: bpf is not only exposing its own > software bugs through its interface, but it will also bite the dust > with CPU bugs due to lack of glue code to hide details behind the > syscall interface curtain. That will need a kernel upgrade after all to > fix, so all benefits of adding new programs. We've even seem claims on > performance being more important than security in this mailing list. > Don't get me wrong, no software is safe from security issues, but if you > don't abstract your resources in the right way, you have more chance to > have experimence more problems. I find it surprising that the person who didn't even know that generating classical BPF was not appropriate in his patches is suddenly a complete expert on eBPF and all of it's shortcomings. Pablo, I am sincerely very disappointed in you, and if you continue to attack eBPF in such an ignorant way going forward we will have a very hard time taking you seriously at all. Thank you.
Re: [PATCH RFC 0/4] net: add bpfilter
On 02/20/2018 11:44 AM, Pablo Neira Ayuso wrote: > Hi David! > > On Mon, Feb 19, 2018 at 12:22:26PM -0500, David Miller wrote: > [...] >> Netfilter's chronic performance differential is why a lot of mindshare >> was lost to userspace networking technologies. > > Claiming that Netfilter is the reason for the massive adoption of > userspace networking isn't a fair statement at all. > > Let's talk about performance if this is what you want: > > * Our benchmarks here are delivering ~x9.5 performance boost for IPv4 > load balancing from netfilter ingress. > > * ~x2 faster than iptables prerouting when dropping packets at very > early stage in the network datapath - dos attack scenario - again from > the ingress hook. > > * The new flowtable infrastructure that will show up in 4.16 provides > a faster forwarding path, measuring ~x2 faster forwarding here, _by > simply adding one single rule to your FORWARD chain_. And that's > just the initial implementation that got merged upstream, we have > room to fly even faster. > > And that's just the beginning, we have more ongoing work, incrementally > based on top of what we have, to provide even faster datapath paths with > very simple configurations. > > Note that those numbers above are very similar numbers to what we have > seen in bpf. Well, to be honest, we're just slightly behind bpf, since > benchmarks I have seen on loading balancing IPv4 is x10 from XDP, > dropping packets also slightly more than x2, which is actually happening > way earlier than ingress, naturally dropping earlier gives us better > numbers. > > But it's not all about performance... let's have a look at the "iron > triangle"... > > We keep usability in our radar, that's paramount for us. Netfilter is > probably so much widely more adopted than tc because of this. The kind Right, in terms of performance the above is what tc ingress used to do already long ago after spinlock removal could be lifted, which was an important step on that direction. In terms of usability, sure, it's always a 'fun' topic on that matter for a number of classifier / actions mostly from the older days. I think there it has improved a bit over time, but at least speaking of things like cls_bpf, it's trivial to attach an object somewhere via tc cmdline. > of problems that big Silicon datacenters have to face are simply > different to the millions of devices running Linux outthere, there are > plenty of smart devops outthere that sacrify the little performance loss > at the cost of keeping it easy to configure and maintain things. > > If we want to talk about problems... > > Every project has its own subset of problems. In that sense, anyone that > has spent time playing with the bpf infrastructure is very much aware of > all of its usability problems: > > * You have to disable optimizations in llvm, otherwise the verifier > gets confused too smart compiler optimizations and rejects the code. That is actually a false claim, which makes me think that you didn't even give this a try at all before stating the above. Funny enough, for a very long period of time in LLVM's BPF back end when you used other optimization levels than the -O2, clang would bark with an internal error, for example: $ clang-3.9 -target bpf -O0 -c foo.c -o /tmp/foo.o fatal error: error in backend: Cannot select: 0x5633ae698280: ch,glue = BPFISD::CALL 0x5633ae698210, 0x5633ae697e90, Register:i64 %R1, Register:i64 %R2, Register:i64 %R3, 0x5633ae698210:1 0x5633ae697e90: i64,ch = load 0x5633ae6955e0, 0x5633ae694fc0, undef:i64 0x5633ae694fc0: i64 = BPFISD::Wrapper TargetGlobalAddress:i64 0 [...] Whereas -O2 *is* the general recommendation for everyone to use: $ clang-3.9 -target bpf -O2 -c foo.c -o /tmp/foo.o $ This is fixed in later versions, e.g. in clang-7.0 such back end error is gone anyway fwiw. But in any case, we're running complex programs with -O2 optimization levels for several years now just fine. Yes, given we do push BPF to the limits we had some corner cases where the verifier had to be adjusted, but overall the number of cases reduced over time, which is also a natural progression when people use it in various advanced ways. In fact, it's a much better choice to use clang with -O2 here since simply the majority of people use it that way. And if you consume it via higher level front ends e.g. bcc, ply, bpftrace to name a few from tracing side, then you don't need to care at all about this. (But in addition to that, there's also continuous effort on LLVM side to optimize BPF code generation in various ways.) > * Very hard to debug the reason why the verifier is rejecting apparently > valid code. That results in people playing strange "moving code around > up and down". Please show me your programs and I'm happy to help you out. :-) Yes, in the earlier days, I would consider it might have been hard; during the course of the last few years, the verifier and LLVM back end hav
Re: [PATCH RFC 0/4] net: add bpfilter
Hi David, On Mon, Feb 19, 2018 at 12:15:37PM -0500, David Miller wrote: > From: Phil Sutter > Date: Mon, 19 Feb 2018 18:09:39 +0100 > > > What puzzles me about your argumentation is that you seem to propose for > > the kernel to cover up flaws in userspace. Spinning this concept further > > would mean that if there would be an old bug in iproute2 we should think > > of adding a workaround to rtnetlink interface in kernel because > > containers will keep the old iproute2 binary? Or am I (hopefully) just > > missing your point? > > I'll answer this with a question. I tried to remove UFO entirely from > the kernel, did you see how that went? :) I didn't follow back then, but found mails about KVM live migration breakage when moving to a kernel without UFO. But isn't that a problem with how virtio_net optimizes things? Florian recently told me how iptables CHECKSUM target was mainly introduced to overcome a different problem in the same area. So all this is kernel covering up for kernel problems. My question was about covering up for userspace bugs in kernelspace. If you think that is preferable over fixing userspace, I have to put that in consideration when dealing with userspace issues. Cheers, Phil
Re: [PATCH RFC 0/4] net: add bpfilter
Hi David! On Mon, Feb 19, 2018 at 12:22:26PM -0500, David Miller wrote: [...] > Netfilter's chronic performance differential is why a lot of mindshare > was lost to userspace networking technologies. Claiming that Netfilter is the reason for the massive adoption of userspace networking isn't a fair statement at all. Let's talk about performance if this is what you want: * Our benchmarks here are delivering ~x9.5 performance boost for IPv4 load balancing from netfilter ingress. * ~x2 faster than iptables prerouting when dropping packets at very early stage in the network datapath - dos attack scenario - again from the ingress hook. * The new flowtable infrastructure that will show up in 4.16 provides a faster forwarding path, measuring ~x2 faster forwarding here, _by simply adding one single rule to your FORWARD chain_. And that's just the initial implementation that got merged upstream, we have room to fly even faster. And that's just the beginning, we have more ongoing work, incrementally based on top of what we have, to provide even faster datapath paths with very simple configurations. Note that those numbers above are very similar numbers to what we have seen in bpf. Well, to be honest, we're just slightly behind bpf, since benchmarks I have seen on loading balancing IPv4 is x10 from XDP, dropping packets also slightly more than x2, which is actually happening way earlier than ingress, naturally dropping earlier gives us better numbers. But it's not all about performance... let's have a look at the "iron triangle"... We keep usability in our radar, that's paramount for us. Netfilter is probably so much widely more adopted than tc because of this. The kind of problems that big Silicon datacenters have to face are simply different to the millions of devices running Linux outthere, there are plenty of smart devops outthere that sacrify the little performance loss at the cost of keeping it easy to configure and maintain things. If we want to talk about problems... Every project has its own subset of problems. In that sense, anyone that has spent time playing with the bpf infrastructure is very much aware of all of its usability problems: * You have to disable optimizations in llvm, otherwise the verifier gets confused too smart compiler optimizations and rejects the code. * Very hard to debug the reason why the verifier is rejecting apparently valid code. That results in people playing strange "moving code around up and down". * Lack of sufficient abstraction: bpf is not only exposing its own software bugs through its interface, but it will also bite the dust with CPU bugs due to lack of glue code to hide details behind the syscall interface curtain. That will need a kernel upgrade after all to fix, so all benefits of adding new programs. We've even seem claims on performance being more important than security in this mailing list. Don't get me wrong, no software is safe from security issues, but if you don't abstract your resources in the right way, you have more chance to have experimence more problems. Just to mention a few of them. So, please, let's focus each of us in our own work. Let me remind your wise words - I think just one year ago in another of these episodes of the bpf vs. netfilter: "We're all working to achieve the same goals", even if we're working on competing projects inside Linux. Thanks!
Re: [PATCH RFC 0/4] net: add bpfilter
On Mon, Feb 19, 2018 at 06:09:39PM +0100, Phil Sutter wrote: > What puzzles me about your argumentation is that you seem to propose for > the kernel to cover up flaws in userspace. Spinning this concept further > would mean that if there would be an old bug in iproute2 we should think > of adding a workaround to rtnetlink interface in kernel because > containers will keep the old iproute2 binary? Or am I (hopefully) just > missing your point? Actually, that's what we already do. This is from rtnl_dump_ifinfo(): /* A hack to preserve kernel<->userspace interface. * The correct header is ifinfomsg. It is consistent with rtnl_getlink. * However, before Linux v3.9 the code here assumed rtgenmsg and that's * what iproute2 < v3.9.0 used. * We can detect the old iproute2. Even including the IFLA_EXT_MASK * attribute, its netlink message is shorter than struct ifinfomsg. */ (There are, in fact, even current tools using rtgenmsg but that's another story.) Michal Kubecek
Re: [PATCH RFC 0/4] net: add bpfilter
> I see several possible areas of contention: > > 1) If you aim for a non-feature-complete support of iptables rules, it >will create confusion to the users. Right, you need full feature parity to be avoid ending up having to maintain two implementations. It seems uncontroversial that BPF can be very powerful if run at iptables hooks. For performance, but also versatility. The android folks are converting one out-of-tree module to BPF. There is probably a lot more such business logic out there that is not suitable for inclusion in mainline as an xt match/target, and that needs more access than xt_bpf can provide. If a new first-class citizen BPF infra can do this and back the legacy interface, too, that would save on maintenance. There is a steady stream of fixes to iptables, e.g., from syzkaller vulnerability reports. Just keeping the old implementation around as a dead letter is not a safe deprecation strategy. To bootstrap bpfilter, in the short term a reasonable set of iptables targets and matches can perhaps be ported to BPF external functions with some simple glue code. > To me, this looks like some kind of legacy backwards compatibility > mechanism that one would find in proprietary operating systems, but not > in Linux. iptables, libiptc etc. are all free software. The source > code can be edited, and you could just as well have a new version of > iptables and/or libiptc which would pass the ruleset in userspace to > your compiler, which would then insert the resulting eBPF program. > > Why add quite comprehensive kerne infrastructure? What's the motivation > here? The ABI deprecation point has been discussed quite a bit. If it is infeasible to just drop the old interface, then an upcall mechanism does seem the most practical approach to dynamically generating this code. FWIW, as BPF is being used in more places, other locations besides iptables could make use of this. > Could you please clarify why the 'filter' table INPUT chain was used if > you're using XDP? AFAICT they have completely different semantics. > > There is a well-conceived and generally understood notion of where > exactly the filter/INPUT table processing happens. And that's not as > early as in the NIC, but it's much later in the processing of the > packet. > > I believe _if_ one wants to use the approach of "hiding" eBPF behind > iptables, then either > > a) the eBPF programs must be executed at the exact same points in the >stack as the existing hooks of the built-in chains of the >filter/nat/mangle/raw tables, or > > b) you must introduce new 'tables', like an 'xdp' table which then has >the notion of processing very early in processing, way before the >normal filter table INPUT processing happens. Agreed. One of the larger issues in the conversion of the Android qtaguid conversion was the state surrounding the skb at the time of processing. This example primarily depended on having skb->sk set. Whether that is available at tc depends on early decap and even when set the sk might prove different from the final one in the socket layer in edge cases. Just one example how moving the call site can be very fragile wrt state. Another issue wrt moving around is availability of external functions at different layers. XDP has access to far fewer than TC. For iptables, I would imagine that you either want parity with TC or even a new independent type. Parity would be useful also to expose some xt_match functionality at the TC layer that is currently missing there. > My main points are: > > 1) What is the goal of this? My high bit feedback: for cases like taguid, it is very useful to be able to execute BPF as drop-in at existing iptables locations, as is having various match and target functionality available from BPF. Maintaining the legacy ABI is basically dictated. If this can be achieved while optimizing the runtime path and reducing maintenance that is very appealing.
Re: [PATCH RFC 0/4] net: add bpfilter
Hi David, On Mon, 19 Feb 2018, Florian Westphal wrote: > David Miller wrote: > > > > Florian, first of all, the whole "change the iptables binary" idea is > > a non-starter. For the many reasons I have described in the various > > postings I have made today. > > > > It is entirely impractical. You stressed several times that container images, virtualization installations don't change - and that's exaggregation. Those are updated as well, and not only because security updates must be rolled out, but because new versions of softwares are requested. You mentioned that the hosting part can upgrade the kernel - it means that enabling NFTABLES is also a non-issue when the new eBPF functionality is switched on, if that was missing. > You suggest: > > iptables -> setsockopt -> umh (xtables -> ebpf) -> kernel > > How is this different from > > iptables -> setsockopt -> umh (Xtables -> nftables -> kernel > > ? > EBPF can be placed within nftables either userspace or kernel, > there is nothing that prevents this. So why the second scenario suggested by Florian is not possible or must be avoided? It not only could keep the unmodified iptables in the container (if that's a must from some reason), but it would make possible to replace it later anytime with iptables-compat/nftables. Best regards, Jozsef - E-mail : kad...@blackhole.kfki.hu, kadlecsik.joz...@wigner.mta.hu PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences H-1525 Budapest 114, POB. 49, Hungary
Re: [PATCH RFC 0/4] net: add bpfilter
David Miller wrote: > From: Phil Sutter > Date: Mon, 19 Feb 2018 18:14:11 +0100 > > > OK, so reading between the lines you're saying that nftables project > > has failed to provide an adequate successor to iptables? > > Whilst it is great that the atomic table update problem was solved, I > think the emphasis on flexibility often at the expense of performance > was a bad move. Thats not true, IMO. One idea previosuly discussed was to add a 'freeze' option to our nftables syntax. Essentially what would happen is that further updates to the table become impossible, with exception of named sets (which can be changed independently similar to ebpf maps is suppose). As further updates to the table are then no longer allowed this would then make it possible to e.g. jit all rules into a single program. The table could still be removed (and recreated) of course so its not impossible to make changes, but no longer at the rule level. > Netfilter's chronic performance differential is why a lot of mindshare > was lost to userspace networking technologies. I think this is a unfair statement and also not true. If you refer to the linear-ruleset-evaluation of iptables, this is what ipset was added for. Yes, its a band aid. But again, that problem come from the UAPI format/limitations of only having one source or destination address per rule, a limitation not present in nftables. Other reason why iptables is a bit more costly than needed (although it IS rather fast given, no spinlocks in main eval loop) are the rule counter updates which were built into the design all those years ago. Again, a problem solved in nftables by making the counters optional. If you want to speedup forward path with XDP -- fine. But AFAIU its still possible with XDP to have packets being sent to full stack, right? If so, it would be possible to even combine nftables with XDP, f.e. by allowing an ebpf program running on host CPU to query netfilter conntrack. No Entry -> push to normal path Entry -> check 'fastpath' flag (which would be in nf_conn struct). Not set -> also normal path. Otherwise continue XDP, stack bypass. nftables would have a rule similar to this: nft add rule inet forward ct state established ct label set fastpath to switch such conntrack to xdp mode. This decision can then be combined with nftables infra, for example 'fatpath for tcp flows that saw more than 1mbit of data in either direction' or the like. Yes, this needs ebpf support for conntrack and NAT transformations, and it does beg question how to handle the details, e.g. conntrack timeouts. Don't see any unsolveable issues with this though. Also has similarities with the 'flow offload' proposal, i.e. we could perhaps even reuse what we already have to add provide flow offload in software using epbf/XDP as offload backend.
Re: [PATCH RFC 0/4] net: add bpfilter
Hi David, On Mon, Feb 19, 2018 at 01:41:29PM -0500, David Miller wrote: > From: Phil Sutter > Date: Mon, 19 Feb 2018 19:05:51 +0100 > > > On Mon, Feb 19, 2018 at 12:22:26PM -0500, David Miller wrote: > >> From: Phil Sutter > >> Date: Mon, 19 Feb 2018 18:14:11 +0100 > >> > >> > OK, so reading between the lines you're saying that nftables project > >> > has failed to provide an adequate successor to iptables? > >> > >> Whilst it is great that the atomic table update problem was solved, I > >> think the emphasis on flexibility often at the expense of performance > >> was a bad move. > > > > I don't see a lack of performance in nftables when being compared to > > iptables (as we have now). From my point of view, it's quite the > > contrary: nftables did a great job in picking up iptables performance > > afterthoughts (e.g. ipset) and leveraging that to the max(TM) (verdict > > maps, concatenated set entries). Assuming the virtual machine design > > principle isn't just marketing but sets the course for JIT ruleset > > optimizations, there's some margin as well. > > > > So from my perspective, one should say nftables increased flexibility > > without sacrificing performance. > > I did not say nftables adjusted performance one way or another. It kept > it on the same order of magnitude. And this is a design decision. Oh, seems I missed your point then. What subject did you have in mind when you wrote "emphasis on flexibility often at the expense of performance"? I thought you were talking about nftables. > > Yes, even with my limited experience I noticed that there is quite some > > demand for even faster packet processing in Linux, mostly for rather > > custom scenarios like forwarding into containers/VMs. Though my point > > was about general purpose firewalling abilities in Linux, say people > > securing their desktop or maintaining networks with less demands on > > performance. > > I've always stated that low power, low end, systems are just a good > place for high performance filtering as high end ones. Do you think these systems are likely to receive a NIC (or some sort of co-processor) which allows for offloading eBPF to? Maybe I miss the point again, but this is the only argument for bpfilter over nftables - and that only if one ignores the option to implement an eBPF backend for nftables VM). OK, maybe this clarifies once I know what you had in mind when you wrote that reply. Cheers, Phil
Re: [PATCH RFC 0/4] net: add bpfilter
From: Harald Welte Date: Mon, 19 Feb 2018 19:37:30 +0100 > I was speaking of actual *users* as in indiiduals running their own > systems, companies running their own servers/datacenter. The fact that > some ISP (or its supplier) decisdes that one of my IP packets is routed > via a smartnic with XDP offloading somewhere is great, but still doesn't > turn me into a "user" of that technology. Not in my linke of thinking, > at least. I am sorry that our opinions differ. I must consider all users of Linux both direct and indirect, to determine impact and where resources and efforts should be allocated. >> And by in large, for system tracing and analysis eBPF is basically >> a hard requirement for people doing anything serious these days. > > That's great, but misses the point. I was referring to usage in the > context of the kernel network stack. Sorry for not being explicit > enough. And that misses the point entirely. Which is that eBPF is more than just networking, so it is missing that this technology is not just networking specific but a kernel wide one that is being adopted in every nook and cranny of the kernel. > Sure, one data center / hosting / "cloud" provider can quickly roll out > a change in their network. But I'm referring to significant, > (Linux-)industry-wide adoption. Hehe, I guess whatever definition works for the position you are trying to take. :-)
Re: [PATCH RFC 0/4] net: add bpfilter
From: Arturo Borrero Gonzalez Date: Mon, 19 Feb 2018 19:06:12 +0100 > Yes, probably major datacenters (google? facebook?, amazon?) they > don't even care about what Debian is doing, since they are crafting > their own distro anyway. But there are *a lot* of other people that > do care about these migration plans. "Lots" is a big word that gets thrown around quite carelessly. What do you imagine is the order of magnitude of cloud and big datacenter server system deployments vs. individual servers and whatnot? I have to take into consideration what will really have the largest impact on the largest number of people, and I am pretty sure I know where that lies.
Re: [PATCH RFC 0/4] net: add bpfilter
From: Phil Sutter Date: Mon, 19 Feb 2018 19:05:51 +0100 > Hi David, > > On Mon, Feb 19, 2018 at 12:22:26PM -0500, David Miller wrote: >> From: Phil Sutter >> Date: Mon, 19 Feb 2018 18:14:11 +0100 >> >> > OK, so reading between the lines you're saying that nftables project >> > has failed to provide an adequate successor to iptables? >> >> Whilst it is great that the atomic table update problem was solved, I >> think the emphasis on flexibility often at the expense of performance >> was a bad move. > > I don't see a lack of performance in nftables when being compared to > iptables (as we have now). From my point of view, it's quite the > contrary: nftables did a great job in picking up iptables performance > afterthoughts (e.g. ipset) and leveraging that to the max(TM) (verdict > maps, concatenated set entries). Assuming the virtual machine design > principle isn't just marketing but sets the course for JIT ruleset > optimizations, there's some margin as well. > > So from my perspective, one should say nftables increased flexibility > without sacrificing performance. I did not say nftables adjusted performance one way or another. It kept it on the same order of magnitude. And this is a design decision. > Yes, even with my limited experience I noticed that there is quite some > demand for even faster packet processing in Linux, mostly for rather > custom scenarios like forwarding into containers/VMs. Though my point > was about general purpose firewalling abilities in Linux, say people > securing their desktop or maintaining networks with less demands on > performance. I've always stated that low power, low end, systems are just a good place for high performance filtering as high end ones.
Re: [PATCH RFC 0/4] net: add bpfilter
Hi David, On Mon, Feb 19, 2018 at 12:29:08PM -0500, David Miller wrote: > People with an Android phone in their pocket is using iptables, and > the overhead and performance of those rules really does matter. It > determines how long your battery life is, etc. I am not the android expert. However, I just dumped the ruleset on my Galaxy Tab S2 (Android 7.1.2 / LineageOS), and it was a whooping 91 rules across all tables. The longest chain interation I could spot was 24 rules. That's not the kind of ruleset where I would expect performance worries. And if there was, nftables is around for quite some time and would be much faster. Sure, that was just one tablet, but I wonder how much Android packet filter performance issue there are. Would be interesting to hear about those (and on whether they benchmarked against nftables). > > I can just as well ask how many millions of users / devices are > > already using eBPF or XDP? > > Every time someone connects to a major provider, they are using it. I was speaking of actual *users* as in indiiduals running their own systems, companies running their own servers/datacenter. The fact that some ISP (or its supplier) decisdes that one of my IP packets is routed via a smartnic with XDP offloading somewhere is great, but still doesn't turn me into a "user" of that technology. Not in my linke of thinking, at least. > And by in large, for system tracing and analysis eBPF is basically > a hard requirement for people doing anything serious these days. That's great, but misses the point. I was referring to usage in the context of the kernel network stack. Sorry for not being explicit enough. Also, the entire point was about "new technologies need time to be adopted widely". Doesn't matter which new kernel feature that is. Sure, one data center / hosting / "cloud" provider can quickly roll out a change in their network. But I'm referring to significant, (Linux-)industry-wide adoption. That would first include major distributions to include/enable/support the feature, and then people actually building their systems/products/software on top of those. > Please see the wonderful work by Brendan Gregg and others which has > basically made the GPL'ing of DTrace by Oracle entirely irrelevant and > our Linux's tracing infrastructure has become must more powerful and > capable thanks to eBPF. Agreed. -- - Harald Weltehttp://laforge.gnumonks.org/ "Privacy in residential applications is a desirable marketing option." (ETSI EN 300 175-7 Ch. A6)
Re: [PATCH RFC 0/4] net: add bpfilter
On 19 February 2018 at 16:36, David Miller wrote: > > In my opinion, any resistence to integration with eBPF and XDP will > lead to even less adoption of netfilter as a technology. > > Therefore my plan is to move everything to be integrated around these > important core technologies. For the purposes of integration, code > coverage, performance, and the ability to juxtapose different bits of > eBPF code into larger optimized code streams that can also be > offloaded into hardware. Thanks for sharing your plans. I'll share mine. Debian already recommends using nftables rather than iptables. Probably in the next release cycle we (Debian) will give even more prominence to nftables by linking iptables to iptables-compat, as an opt-in for users, so we don't break systems. By the next-next release cycle (4+ years or so?) we will probably have enough confidence with compat/translation tools that Debian could fully wipe the old iptables binary to use just the nftables framework. Same for ip6tables, arptables, ebtables. Does this sound reasonable to you? Yes, probably major datacenters (google? facebook?, amazon?) they don't even care about what Debian is doing, since they are crafting their own distro anyway. But there are *a lot* of other people that do care about these migration plans.
Re: [PATCH RFC 0/4] net: add bpfilter
Hi David, On Mon, Feb 19, 2018 at 12:22:26PM -0500, David Miller wrote: > From: Phil Sutter > Date: Mon, 19 Feb 2018 18:14:11 +0100 > > > OK, so reading between the lines you're saying that nftables project > > has failed to provide an adequate successor to iptables? > > Whilst it is great that the atomic table update problem was solved, I > think the emphasis on flexibility often at the expense of performance > was a bad move. I don't see a lack of performance in nftables when being compared to iptables (as we have now). From my point of view, it's quite the contrary: nftables did a great job in picking up iptables performance afterthoughts (e.g. ipset) and leveraging that to the max(TM) (verdict maps, concatenated set entries). Assuming the virtual machine design principle isn't just marketing but sets the course for JIT ruleset optimizations, there's some margin as well. So from my perspective, one should say nftables increased flexibility without sacrificing performance. > Netfilter's chronic performance differential is why a lot of mindshare > was lost to userspace networking technologies. > > Thankfully, we are gaining back a lot of that userbase with XDP and > eBPF, thanks to the hard work of many individuals. > > To think that people are going to be willing to take the performance > hit (whatever it's size) to go back to the "more flexible" nftables > is really not a realistic expectation. > > And we have amassed enough interest and momentum that offloading eBPF > in hardware on current and future hardware is happening. > > So I am going to direct us in directions that allow those realities to > be taken advantage of, rather than pretending that this transition > hasn't occurred already. Hey, you secretly changed the topic! ;) Yes, even with my limited experience I noticed that there is quite some demand for even faster packet processing in Linux, mostly for rather custom scenarios like forwarding into containers/VMs. Though my point was about general purpose firewalling abilities in Linux, say people securing their desktop or maintaining networks with less demands on performance. I guess it will be a while until consumer hardware comes with smart NICs (or they become affordable), so for those people nftables is definitely a step forward. Cheers, Phil
Re: [PATCH RFC 0/4] net: add bpfilter
On 19 February 2018 at 16:27, David Miller wrote: > From: Florian Westphal > Date: Mon, 19 Feb 2018 16:15:55 +0100 > >> Would you be willing to merge nftables into kernel tools directory >> then? > > Did you miss the part where I explained that people explicitly disable > NFTABLES in their kernel configs in most if not all large datacenters? hey, you already shared several statements regarding nftables which are not true. Lots and lots of people are using distribution kernels, which contains NF_TABLES config enabled (all major distros have it) I believe people who build their own kernels are very few if you compare with the number of people who don't (but yeah, they usually have more money). This may sounds as a joke, but there are *a lot* of people running productions servers with bluetooth drivers enabled in the kconfig. So, I can confirm that: Lots of people and institutions are using nftables already. Lots of people and institutions are considering transition to nftables it from iptables. Lots of people are running simple commodity hardware and know nothing about smartnics or any kind of offloading
Re: [PATCH RFC 0/4] net: add bpfilter
On 19 February 2018 at 16:36, David Miller wrote: > > I think netfilter is at a real crossroads right now. > I don't think so. The Netfilter Project and the Netfilter Community already "agreed" on nftables and we are working on it. But this isn't a secret, right? We have been open-discussing and open-working on this for *years* now.
Re: [PATCH RFC 0/4] net: add bpfilter
Hi David, On Mon, Feb 19, 2018 at 10:31:39AM -0500, David Miller wrote: > > Why is it practical to replace your kernel but not practical to replace > > a small userspace tool running on top of it? > > The container is just userspace components. Those are really baked in > and are never changing. never until you have to apply a bug fix for any of the many components you bake into it. I am doing this on an (at least) weekly basis for my Docker containers. That's no different from a classic Linux distribution where you update your apt/rpm packages all the time. A container that is static and cannot continuously updated with new versions for security (and other) fixes is broken by design. If some people are doing this, they IMHO have no sense of IT security, and such usage pattersn are not what kernel development should cite as primary use case (again IMHO). > This is how cloud hosting environments work. Yes, *one* particular use case. By far not every use case of Linux, or Linux packet filtering. -- - Harald Weltehttp://laforge.gnumonks.org/ "Privacy in residential applications is a desirable marketing option." (ETSI EN 300 175-7 Ch. A6)
Re: [PATCH RFC 0/4] net: add bpfilter
From: Harald Welte Date: Mon, 19 Feb 2018 18:20:40 +0100 > It's like with any migration. People were using ipchains for a long > time even after iptables existed. Many people simply don't care > about packet filter performance. It's only a small fraction of their > entire CPU workload, so probably not worth optimzing. For dedicated > firewall devices, that's of course a different story. "I have power in my house, what's the big deal about this power outage I hear about?" People with an Android phone in their pocket is using iptables, and the overhead and performance of those rules really does matter. It determines how long your battery life is, etc. > I can just as well ask how many millions of users / devices are > already using eBPF or XDP? Every time someone connects to a major provider, they are using it. And by in large, for system tracing and analysis eBPF is basically a hard requirement for people doing anything serious these days. Please see the wonderful work by Brendan Gregg and others which has basically made the GPL'ing of DTrace by Oracle entirely irrelevant and our Linux's tracing infrastructure has become must more powerful and capable thanks to eBPF.
Re: [PATCH RFC 0/4] net: add bpfilter
From: Phil Sutter Date: Mon, 19 Feb 2018 18:14:11 +0100 > OK, so reading between the lines you're saying that nftables project > has failed to provide an adequate successor to iptables? Whilst it is great that the atomic table update problem was solved, I think the emphasis on flexibility often at the expense of performance was a bad move. Netfilter's chronic performance differential is why a lot of mindshare was lost to userspace networking technologies. Thankfully, we are gaining back a lot of that userbase with XDP and eBPF, thanks to the hard work of many individuals. To think that people are going to be willing to take the performance hit (whatever it's size) to go back to the "more flexible" nftables is really not a realistic expectation. And we have amassed enough interest and momentum that offloading eBPF in hardware on current and future hardware is happening. So I am going to direct us in directions that allow those realities to be taken advantage of, rather than pretending that this transition hasn't occurred already. Thank you.
Re: [PATCH RFC 0/4] net: add bpfilter
Hi David, On Mon, Feb 19, 2018 at 10:36:51AM -0500, David Miller wrote: > nftables has been proported as "better" for years, yet large > institutions did not migrate to it. In fact, they explicitly > disabled NFTABLES in their kernel config. It's like with any migration. People were using ipchains for a long time even after iptables existed. Many people simply don't care about packet filter performance. It's only a small fraction of their entire CPU workload, so probably not worth optimzing. For dedicated firewall devices, that's of course a different story. How long did it take for the getrandom() system call to be actually used by applications [even glibc!]? Or many other things that get introduced in the kernel? I can just as well ask how many millions of users / devices are already using eBPF or XDP? How many major Linux distributions are enabling and/or supporting this yet? I'm not criticizing, I'm just attempting to illustrate that technologies always take time to establish themselves - and of course those people with the biggest benefit (and knowing about it) will be the early adopters, while many others have no motivation to migrate. > In my opinion, any resistence to integration with eBPF and XDP will > lead to even less adoption of netfilter as a technology. 1) I may not have made my point clear, sorry. I have not argued against any integration with eBPF, I have just made some specific arguments against specific aspects of the current RFC. 2) You have indicated repeatedly that there are millions and millions of netfilter/iptables users out there. So I fail to see the "even less adoption" part. "Even less" than those millions and millions? SCNR. Regards, Harald -- - Harald Weltehttp://laforge.gnumonks.org/ "Privacy in residential applications is a desirable marketing option." (ETSI EN 300 175-7 Ch. A6)
Re: [PATCH RFC 0/4] net: add bpfilter
From: Phil Sutter Date: Mon, 19 Feb 2018 18:09:39 +0100 > What puzzles me about your argumentation is that you seem to propose for > the kernel to cover up flaws in userspace. Spinning this concept further > would mean that if there would be an old bug in iproute2 we should think > of adding a workaround to rtnetlink interface in kernel because > containers will keep the old iproute2 binary? Or am I (hopefully) just > missing your point? I'll answer this with a question. I tried to remove UFO entirely from the kernel, did you see how that went?
Re: [PATCH RFC 0/4] net: add bpfilter
Hi David, On Mon, Feb 19, 2018 at 10:44:59AM -0500, David Miller wrote: > From: Harald Welte > Date: Mon, 19 Feb 2018 16:38:08 +0100 > > > On Mon, Feb 19, 2018 at 10:27:27AM -0500, David Miller wrote: > >> > Would you be willing to merge nftables into kernel tools directory > >> > then? > >> > >> Did you miss the part where I explained that people explicitly disable > >> NFTABLES in their kernel configs in most if not all large datacenters? > > > > If people to chose to disable a certain feature, then that is their own > > decision to do so. We should respect that decision. Clearly they seem > > to have no interest in a better or more featureful packet filter, then. > > > > I mean, it's not like somebody proposes to implement NTFS inside the FAT > > filesystem kernel module because distributors (or data centers) tend to > > disable the NTFS module?! > > > > How is kernel development these days constrained by what some users may > > or may not put in their Kconfig? If they want a given feature, they > > must enable it. > > This discussion was about why iptables UABI still matters. > > And I'm trying to explain to you one of several reasons why it does. > > Also, instead of saying "They decided to not use NFTABLES, oh well > that is their problem" it might be more beneficial, especially in the > long term for netfilter, to think about "why". OK, so reading between the lines you're saying that nftables project has failed to provide an adequate successor to iptables? Cheers, Phil
Re: [PATCH RFC 0/4] net: add bpfilter
Hi David, On Mon, Feb 19, 2018 at 10:31:39AM -0500, David Miller wrote: > From: Harald Welte > Date: Mon, 19 Feb 2018 16:27:46 +0100 > > > On Mon, Feb 19, 2018 at 10:13:35AM -0500, David Miller wrote: > > > >> Florian, first of all, the whole "change the iptables binary" idea is > >> a non-starter. For the many reasons I have described in the various > >> postings I have made today. > >> > >> It is entirely impractical. > > > > Why is it practical to replace your kernel but not practical to replace > > a small userspace tool running on top of it? > > The container is just userspace components. Those are really baked in > and are never changing. Which is a problem per se. Cheap hardware routers are a good example of why business models which tend to get customers stuck with old software have such dramatic effects at least in matters of security. > The hosting element, on the other hand, can upgrade the kernel in that > scenerio no problem. > > This is how cloud hosting environments work. What puzzles me about your argumentation is that you seem to propose for the kernel to cover up flaws in userspace. Spinning this concept further would mean that if there would be an old bug in iproute2 we should think of adding a workaround to rtnetlink interface in kernel because containers will keep the old iproute2 binary? Or am I (hopefully) just missing your point? Cheers, Phil
Re: [PATCH RFC 0/4] net: add bpfilter
From: Harald Welte Date: Mon, 19 Feb 2018 16:38:08 +0100 > On Mon, Feb 19, 2018 at 10:27:27AM -0500, David Miller wrote: >> > Would you be willing to merge nftables into kernel tools directory >> > then? >> >> Did you miss the part where I explained that people explicitly disable >> NFTABLES in their kernel configs in most if not all large datacenters? > > If people to chose to disable a certain feature, then that is their own > decision to do so. We should respect that decision. Clearly they seem > to have no interest in a better or more featureful packet filter, then. > > I mean, it's not like somebody proposes to implement NTFS inside the FAT > filesystem kernel module because distributors (or data centers) tend to > disable the NTFS module?! > > How is kernel development these days constrained by what some users may > or may not put in their Kconfig? If they want a given feature, they > must enable it. This discussion was about why iptables UABI still matters. And I'm trying to explain to you one of several reasons why it does. Also, instead of saying "They decided to not use NFTABLES, oh well that is their problem" it might be more beneficial, especially in the long term for netfilter, to think about "why".
Re: [PATCH RFC 0/4] net: add bpfilter
From: Jan Engelhardt Date: Mon, 19 Feb 2018 16:37:57 +0100 (CET) > On Monday 2018-02-19 16:32, David Miller wrote: > >>From: Harald Welte >>Date: Mon, 19 Feb 2018 16:23:21 +0100 >> >>> Also, as long as legacy ip_tables/x_tables is still in the kernel, you >>> can still run your old userspace against that old implementation in the >>> kernel. >> >>But without offloading, and the various other benefits which I have >>tried to clearly explain to both you and Florian. > > Which is actually the business model to get people *off* the old ABI in > reasonable time. Hosting companies can't change what customers run in their containers. But if they are told that a kernel upgrade will get them offloading and increase their performance termendously, then that gives them real value.
Re: [PATCH RFC 0/4] net: add bpfilter
Dear David, On Mon, Feb 19, 2018 at 10:27:27AM -0500, David Miller wrote: > > Would you be willing to merge nftables into kernel tools directory > > then? > > Did you miss the part where I explained that people explicitly disable > NFTABLES in their kernel configs in most if not all large datacenters? If people to chose to disable a certain feature, then that is their own decision to do so. We should respect that decision. Clearly they seem to have no interest in a better or more featureful packet filter, then. I mean, it's not like somebody proposes to implement NTFS inside the FAT filesystem kernel module because distributors (or data centers) tend to disable the NTFS module?! How is kernel development these days constrained by what some users may or may not put in their Kconfig? If they want a given feature, they must enable it. -- - Harald Weltehttp://laforge.gnumonks.org/ "Privacy in residential applications is a desirable marketing option." (ETSI EN 300 175-7 Ch. A6)
Re: [PATCH RFC 0/4] net: add bpfilter
On Monday 2018-02-19 16:32, David Miller wrote: >From: Harald Welte >Date: Mon, 19 Feb 2018 16:23:21 +0100 > >> Also, as long as legacy ip_tables/x_tables is still in the kernel, you >> can still run your old userspace against that old implementation in the >> kernel. > >But without offloading, and the various other benefits which I have >tried to clearly explain to both you and Florian. Which is actually the business model to get people *off* the old ABI in reasonable time. Otherwise, we would have to ask ourselves why we have not yet enhanced /dev/raw with mmap and whatnot.
Re: [PATCH RFC 0/4] net: add bpfilter
From: Harald Welte Date: Mon, 19 Feb 2018 16:23:21 +0100 >> Like it or not iptables ABI based filtering is going to be in the data >> path for many years if not a decade or more to come. > > I beg to differ. For some people, yes. but then, as Florian points > out, they can just as well use the existing x_tables kernel code. If > they want something better, they can either replace their iptables > program with xtables-compat from nftables, or whatever else might > exist for eBPF support. nftables has been proported as "better" for years, yet large institutions did not migrate to it. In fact, they explicitly disabled NFTABLES in their kernel config. You may want to ponder for a little while why that might be. I think netfilter is at a real crossroads right now. In my opinion, any resistence to integration with eBPF and XDP will lead to even less adoption of netfilter as a technology. Therefore my plan is to move everything to be integrated around these important core technologies. For the purposes of integration, code coverage, performance, and the ability to juxtapose different bits of eBPF code into larger optimized code streams that can also be offloaded into hardware.
Re: [PATCH RFC 0/4] net: add bpfilter
From: Harald Welte Date: Mon, 19 Feb 2018 16:23:21 +0100 > Also, as long as legacy ip_tables/x_tables is still in the kernel, you > can still run your old userspace against that old implementation in the > kernel. But without offloading, and the various other benefits which I have tried to clearly explain to both you and Florian.
Re: [PATCH RFC 0/4] net: add bpfilter
From: Harald Welte Date: Mon, 19 Feb 2018 16:27:46 +0100 > On Mon, Feb 19, 2018 at 10:13:35AM -0500, David Miller wrote: > >> Florian, first of all, the whole "change the iptables binary" idea is >> a non-starter. For the many reasons I have described in the various >> postings I have made today. >> >> It is entirely impractical. > > Why is it practical to replace your kernel but not practical to replace > a small userspace tool running on top of it? The container is just userspace components. Those are really baked in and are never changing. The hosting element, on the other hand, can upgrade the kernel in that scenerio no problem. This is how cloud hosting environments work.
Re: [PATCH RFC 0/4] net: add bpfilter
Hi David, On Mon, Feb 19, 2018 at 10:13:35AM -0500, David Miller wrote: > Florian, first of all, the whole "change the iptables binary" idea is > a non-starter. For the many reasons I have described in the various > postings I have made today. > > It is entirely impractical. Why is it practical to replace your kernel but not practical to replace a small userspace tool running on top of it? -- - Harald Weltehttp://laforge.gnumonks.org/ "Privacy in residential applications is a desirable marketing option." (ETSI EN 300 175-7 Ch. A6)
Re: [PATCH RFC 0/4] net: add bpfilter
Hi David, On Mon, Feb 19, 2018 at 09:44:51AM -0500, David Miller wrote: > I see talk about "just replacing the iptables binary". > > A long time ago in a galaxy far far away, that would be a reasonable > scheme. But that kind of approach won't work in the realities of > today. > > You aren't going to be able to replace the iptables binary in the tens > of thousands of container images out there, nor the virtualization > installations either. I appear to have been under the impression that the entire movement to DevOps and automatic provisioning of containers/nodes/pods/Vms with puppet, ansible, Dockerfiles & Co is to be *more* agile in deployments, rather than less. If you cannot even rebuild your thousands of container images with updated binaries, then what is this all worth? You need to be able to update the iptables (or any other) binary in case there's an important (security or otherwise) bug that needs fixing. I don't see this any different. Also, as long as legacy ip_tables/x_tables is still in the kernel, you can still run your old userspace against that old implementation in the kernel. Nobody forces you to use anything else [for another decade or so]. Just if you want to take advantage of new more modular/performant/... things like nftables or an eBPF backend, then you would have to go that extra mile. I don't think the kernel (network) developers should burden themselves with too many things. There's sufficient on their plate as-is. So * if there's some new system (nftables, bpfilter, ...) * and some documented migration paths for the vast majority of the use cases (replacing iptables binaries with a compat wrapper) * and the old system continues to work as-is (x_tables kernel code stays for several more years) Then people who care about the new features or performance will migrate to the new system. And those who don't care stay with the old system - which is not a problem as they clearly wouldn't need the new system anyway. > Like it or not iptables ABI based filtering is going to be in the data > path for many years if not a decade or more to come. I beg to differ. For some people, yes. but then, as Florian points out, they can just as well use the existing x_tables kernel code. If they want something better, they can either replace their iptables program with xtables-compat from nftables, or whatever else might exist for eBPF support. > iptables is a victim of it's own success, like it or not :-) Yes, the > ABI is terrible, but obviously it was useful enough for lots of > people. and it continues to do so. I just don't think it is a great idea to kludge any new packet filter against such an arcane uapi. > Therefore it behooves us to accept this reality and align the data > path generated to match what the rest of the kernel is moving towards > and that is eBPF and XDP. This argument is unrelated to the question of the uapi. I'm not arguing against an eBPF backend/implementation for packet filtering. It's more a question of _how_. > Furthrmore, on a long term maintainence perspective, it means that > every data path used by the kernel for iptables will be fully verified > by the eBPF verifier. This means that the iptables data path will be > guaranteed to never get into a loop, access out of bounds data, etc. > > That to me is real power, and something we should pursue. Once again, both not related to the question of the uapi. > I know you can't see how offloading is possible, but I hope > are some further discussion you can see how that might work. I'm looking forward to that point. Regards, Harald -- - Harald Weltehttp://laforge.gnumonks.org/ "Privacy in residential applications is a desirable marketing option." (ETSI EN 300 175-7 Ch. A6)
Re: [PATCH RFC 0/4] net: add bpfilter
From: Florian Westphal Date: Mon, 19 Feb 2018 16:20:23 +0100 > See my other mail, where I explained, in great detail, the problems > of the xtables UAPI. As the person who wrote the bpfilter UAPI parser for this, you don't need to explain this to me. But it's not going anywhere, and is used by millions upon millions of users.
Re: [PATCH RFC 0/4] net: add bpfilter
From: Florian Westphal Date: Mon, 19 Feb 2018 16:15:55 +0100 > Would you be willing to merge nftables into kernel tools directory > then? Did you miss the part where I explained that people explicitly disable NFTABLES in their kernel configs in most if not all large datacenters?
Re: [PATCH RFC 0/4] net: add bpfilter
David Miller wrote: > From: Florian Westphal > Date: Mon, 19 Feb 2018 15:53:14 +0100 > > > Sure, but looking at all the things that were added to iptables > > to alleviate some of the issues (ipset for instance) show that we need a > > meaningful re-design of how things work conceptually. > > As you said iptables is in maintainenance mode. > > But there are millions upon millions of users, like it or not, and > they aren't going away for decades. And this is the iptables binary > ABI I'm talking about, not the iptables user command line interface. I know. > my house?" Please see further than the view inside your home. > > By in large, we are stuck with iptables's data path for an extremely > long time. So? > Major data centers doesn't even enable NFTABLES in their kernels, and > there is nothing you can do about that in the short to medium term. So? > Therefore, for all of the beneficial reasons I have discussed we > should make that datapath as aligned and integrated with our core > important technologies as possible, so that they can benefit from any > and all improvements in that area rather than just collecting dust. See my other mail, where I explained, in great detail, the problems of the xtables UAPI. If you go through with this, and, eventually somehow get feature parity, all of the problems remain in full effect. You will also need to replicate the translation efforts that already went into nftables. The translator wasn't yet a high priority as we lacked some features but this can be changed now that nft is catching up. Userspace program expectation is for iptables to be like fib for instance, i.e. you can add and remove without stomping on each others feet. You are setting this in stone. You're also adding a way to make it so that I can delete entries from the fib (bpfilter) but iproute2 will still show all entries (iptables legacy).
Re: [PATCH RFC 0/4] net: add bpfilter
David Miller wrote: > From: Florian Westphal > Date: Mon, 19 Feb 2018 15:59:35 +0100 > > > David Miller wrote: > >> It also means that the scope of developers who can contribute and work > >> on the translater is much larger. > > > > How so? Translator is in userspace in nftables case too? > > Florian, first of all, the whole "change the iptables binary" idea is > a non-starter. For the many reasons I have described in the various > postings I have made today. > > It is entirely impractical. ??? You suggest: iptables -> setsockopt -> umh (xtables -> ebpf) -> kernel How is this different from iptables -> setsockopt -> umh (Xtables -> nftables -> kernel ? EBPF can be placed within nftables either userspace or kernel, there is nothing that prevents this. > Anything designed in that nature must be distributed completely in the > kernel tree, so that the iptables kernel ABI is provided without any > externel dependencies. Would you be willing to merge nftables into kernel tools directory then?
Re: [PATCH RFC 0/4] net: add bpfilter
From: Florian Westphal Date: Mon, 19 Feb 2018 15:59:35 +0100 > David Miller wrote: >> It also means that the scope of developers who can contribute and work >> on the translater is much larger. > > How so? Translator is in userspace in nftables case too? Florian, first of all, the whole "change the iptables binary" idea is a non-starter. For the many reasons I have described in the various postings I have made today. It is entirely impractical. So we are strictly talking about the code we are writing to translate iptables ABI (in the kernel) into an eBPF based datapath. Anything designed in that nature must be distributed completely in the kernel tree, so that the iptables kernel ABI is provided without any externel dependencies. We could have done the translater in in the kernel, but instead we are doing it with a userland component. And that's what we are talking about. Thank you.
Re: [PATCH RFC 0/4] net: add bpfilter
From: Florian Westphal Date: Mon, 19 Feb 2018 15:53:14 +0100 > Sure, but looking at all the things that were added to iptables > to alleviate some of the issues (ipset for instance) show that we need a > meaningful re-design of how things work conceptually. As you said iptables is in maintainenance mode. But there are millions upon millions of users, like it or not, and they aren't going away for decades. And this is the iptables binary ABI I'm talking about, not the iptables user command line interface. These discussions about nftables migrations sound like a person near a power outage who exclaims: "What's the big deal, the lights are on in my house?" Please see further than the view inside your home. By in large, we are stuck with iptables's data path for an extremely long time. Major data centers doesn't even enable NFTABLES in their kernels, and there is nothing you can do about that in the short to medium term. Therefore, for all of the beneficial reasons I have discussed we should make that datapath as aligned and integrated with our core important technologies as possible, so that they can benefit from any and all improvements in that area rather than just collecting dust. Thank you.
Re: [PATCH RFC 0/4] net: add bpfilter
David Miller wrote: > From: Daniel Borkmann > Date: Mon, 19 Feb 2018 13:03:17 +0100 > > > Thought was that it would be more suitable to push all the complexity of > > such translation into user space which brings couple of additional > > advantages > > as well: the translation can become very complex and thus it would contain > > all of it behind syscall boundary where natural path of loading programs > > would go via verifier. Given the tool would reside in user space, it would > > also allow to ease development and testing can happen w/o recompiling the > > kernel. It would allow for all the clang sanitizers to run there and for > > having a comprehensive test suite to verify and dry test translations > > against > > traffic test patterns (e.g. bpf infra would provide possibilities on this > > w/o complex setup). Given normal user mode helpers make this rather painful > > since they need to be shipped as extra package by the various distros, the > > idea was that the module loader back end could treat umh similarly as kernel > > modules and hook them in through request_module() approach while still > > operating out of user space. In any case, I could image this approach might > > be interesting and useful in general also for other subsystems requiring > > umh in one way or another. > > Yes, this is a very powerful new facility. > > It also means that the scope of developers who can contribute and work > on the translater is much larger. How so? Translator is in userspace in nftables case too?
Re: [PATCH RFC 0/4] net: add bpfilter
From: Daniel Borkmann Date: Mon, 19 Feb 2018 13:03:17 +0100 > Thought was that it would be more suitable to push all the complexity of > such translation into user space which brings couple of additional advantages > as well: the translation can become very complex and thus it would contain > all of it behind syscall boundary where natural path of loading programs > would go via verifier. Given the tool would reside in user space, it would > also allow to ease development and testing can happen w/o recompiling the > kernel. It would allow for all the clang sanitizers to run there and for > having a comprehensive test suite to verify and dry test translations against > traffic test patterns (e.g. bpf infra would provide possibilities on this > w/o complex setup). Given normal user mode helpers make this rather painful > since they need to be shipped as extra package by the various distros, the > idea was that the module loader back end could treat umh similarly as kernel > modules and hook them in through request_module() approach while still > operating out of user space. In any case, I could image this approach might > be interesting and useful in general also for other subsystems requiring > umh in one way or another. Yes, this is a very powerful new facility. It also means that the scope of developers who can contribute and work on the translater is much larger. When we showed this infrastructure to Linus he thought it was a very sane idea.
Re: [PATCH RFC 0/4] net: add bpfilter
David Miller wrote: > > How many of those wide-spread applications are you aware of? The two > > projects you have pointed out (docker and kubernetes) don't. As the > > assumption that many such tools would need to be supported drives a lot > > of the design decisions, I would argue one needs a solid empircal basis. > > I see talk about "just replacing the iptables binary". > > A long time ago in a galaxy far far away, that would be a reasonable > scheme. But that kind of approach won't work in the realities of > today. > > You aren't going to be able to replace the iptables binary in the tens > of thousands of container images out there, nor the virtualization > installations either. Why would you have to? iptables kernel parts are still maintained, its not dead code that stands in the way. We can leave it alone, in maintenance mode, just fine. > Like it or not iptables ABI based filtering is going to be in the data > path for many years if not a decade or more to come. iptables is a > victim of it's own success, like it or not :-) Yes, the ABI is > terrible, but obviously it was useful enough for lots of people. Sure, but looking at all the things that were added to iptables to alleviate some of the issues (ipset for instance) show that we need a meaningful re-design of how things work conceptually. The umh helper translation that has been proposed could be applied to transparently xlate iptables to nftables (or e.g. iptables compat32 to iptables64), i.e. legacy binary talks to kernel, kernel invokes umh, umh generates nftables netlink messages). But I don't even see a need to do this, I don't think its an issue to leave it in the tree even for another decade or more if needed be.
Re: [PATCH RFC 0/4] net: add bpfilter
From: Harald Welte Date: Mon, 19 Feb 2018 13:52:18 +0100 >> Right, having a custom iptables, libiptc or LD_PRELOAD approach would work >> as well of course, but it still wouldn't address applications that have >> their own custom libs programmed against iptables uapi directly or those >> that reused a builtin or modified libiptc directly in their application. > > How many of those wide-spread applications are you aware of? The two > projects you have pointed out (docker and kubernetes) don't. As the > assumption that many such tools would need to be supported drives a lot > of the design decisions, I would argue one needs a solid empircal basis. I see talk about "just replacing the iptables binary". A long time ago in a galaxy far far away, that would be a reasonable scheme. But that kind of approach won't work in the realities of today. You aren't going to be able to replace the iptables binary in the tens of thousands of container images out there, nor the virtualization installations either. Like it or not iptables ABI based filtering is going to be in the data path for many years if not a decade or more to come. iptables is a victim of it's own success, like it or not :-) Yes, the ABI is terrible, but obviously it was useful enough for lots of people. Therefore it behooves us to accept this reality and align the data path generated to match what the rest of the kernel is moving towards and that is eBPF and XDP. Furthrmore, on a long term maintainence perspective, it means that every data path used by the kernel for iptables will be fully verified by the eBPF verifier. This means that the iptables data path will be guaranteed to never get into a loop, access out of bounds data, etc. That to me is real power, and something we should pursue. This doesn't even get into the offloading and other benefits that are possible. I know you can't see how offloading is possible, but I hope are some further discussion you can see how that might work. Thanks.
Re: [PATCH RFC 0/4] net: add bpfilter
Hi Daniel, On Mon, Feb 19, 2018 at 01:03:17PM +0100, Daniel Borkmann wrote: > Hi Harald, > > On 02/17/2018 01:11 PM, Harald Welte wrote: > [...] > >> As rule translation can potentially become very complex, this is performed > >> entirely in user space. In order to ease deployment, request_module() code > >> is extended to allow user mode helpers to be invoked. Idea is that user > >> mode > >> helpers are built as part of the kernel build and installed as traditional > >> kernel modules with .ko file extension into distro specified location, > >> such that from a distribution point of view, they are no different than > >> regular kernel modules. > > > > That just blew my mind, sorry :) This goes much beyond > > netfilter/iptables, and adds some quiet singificant new piece of > > kernel/userspace infrastructure. To me, my apologies, it just sounds > > like a quite strange hack. But then, I may lack the vision of how this > > might be useful in other contexts. > > Thought was that it would be more suitable to push all the complexity of > such translation into user space [...] Sure, you have no complaints from my side about that goal. I'm just not sure if turning the kernel module loader into a new mechanism to start userspace processes is. I guess that's a question that the people involved with core kernel code and module loader have to answer. To me it seems like a very lng detour away from the actual topic (packet filtering). > Given normal user mode helpers make this rather painful since they > need to be shipped as extra package by the various distros, the idea > was that the module loader back end could treat umh similarly as > kernel modules and hook them in through request_module() approach > while still operating out of user space. In any case, I could image > this approach might be interesting and useful in general also for > other subsystems requiring umh in one way or another. I completely agree this approach has some logic to it. I just think the approach taken is *very* different from what has been traditionally done in the Linux world. All sorts of userspace programs to configure kernel features (iptables being one of them iproute2, etc.) have always been distributed as separate/independent application programs, which are packaged separately, etc. Making the kernel source tree build such userspace utilities and executing them in a new fashion via the kernel module loaders are to me two quite large conceptual changes on how "Linux works", and I believe you will have to "sell" this idea to many people outside the kernel networking communit, i.e. core kernel developers, people who do packaging, etc. I'm not saying I'm fundamentally opposed to it. Will be curious to see how the wider kernel community thinks of that architecture. > Right, having a custom iptables, libiptc or LD_PRELOAD approach would work > as well of course, but it still wouldn't address applications that have > their own custom libs programmed against iptables uapi directly or those > that reused a builtin or modified libiptc directly in their application. How many of those wide-spread applications are you aware of? The two projects you have pointed out (docker and kubernetes) don't. As the assumption that many such tools would need to be supported drives a lot of the design decisions, I would argue one needs a solid empircal basis. Also, the LD_PRELOAD wrapper *would* work with all those programs. Only the iptables command line replacement wouldn't catch those. > Such requests could only be covered transparently by having a small shim > layer in kernel and it also wouldn't require any extra packages from distro > side. What is wrong with extra packages in distributions? Distributions also will have to update the kernel to include your new code, so they could at the same time use a new iptables (or $whatever) package. This is true for virtually all new kernel features. Your userland needs to go along with it, if it wants to use those new features. > > Some of those can be implemented easily in BPF (like recomputing the > > checksum or the like). Some others I would find much more difficult - > > particularly if you want to off-load it to the NIC. They require access > > to state that only the kernel has (like 'cgroup' or 'owner' matching). > > Yeah, when it comes to offloading, the latter two examples are heavily tied > to upper layers of the (local) stack, so for cases like those it wouldn't > make much sense, but e.g. matches, mangling or forwarding based on packet > data are obvious candidates that can already be offloaded today in a > flexible and programmable manner all with existing BPF infra, so for those > it could definitely be highly interesting to make use of it. While I believe you there are many ways how one can offload things flexibly with eBPF, I still have a hard time understanding how you want to merge this with the existing well-defined notion of when exactly a given chain
Re: [PATCH RFC 0/4] net: add bpfilter
Hi Harald, On 02/17/2018 01:11 PM, Harald Welte wrote: [...] >> As rule translation can potentially become very complex, this is performed >> entirely in user space. In order to ease deployment, request_module() code >> is extended to allow user mode helpers to be invoked. Idea is that user mode >> helpers are built as part of the kernel build and installed as traditional >> kernel modules with .ko file extension into distro specified location, >> such that from a distribution point of view, they are no different than >> regular kernel modules. > > That just blew my mind, sorry :) This goes much beyond > netfilter/iptables, and adds some quiet singificant new piece of > kernel/userspace infrastructure. To me, my apologies, it just sounds > like a quite strange hack. But then, I may lack the vision of how this > might be useful in other contexts. Thought was that it would be more suitable to push all the complexity of such translation into user space which brings couple of additional advantages as well: the translation can become very complex and thus it would contain all of it behind syscall boundary where natural path of loading programs would go via verifier. Given the tool would reside in user space, it would also allow to ease development and testing can happen w/o recompiling the kernel. It would allow for all the clang sanitizers to run there and for having a comprehensive test suite to verify and dry test translations against traffic test patterns (e.g. bpf infra would provide possibilities on this w/o complex setup). Given normal user mode helpers make this rather painful since they need to be shipped as extra package by the various distros, the idea was that the module loader back end could treat umh similarly as kernel modules and hook them in through request_module() approach while still operating out of user space. In any case, I could image this approach might be interesting and useful in general also for other subsystems requiring umh in one way or another. > I'm trying to understand why exactly one would > * use a 18 year old iptables userspace program with its equally old > setsockopt based interface between kernel and userspace > * insert an entire table with many chains of rules into the kernel > * re-eject that ruleset into another userspace program which then > compiles it into an eBPF program > * inserert that back into the kernel > > To me, this looks like some kind of legacy backwards compatibility > mechanism that one would find in proprietary operating systems, but not > in Linux. iptables, libiptc etc. are all free software. The source > code can be edited, and you could just as well have a new version of > iptables and/or libiptc which would pass the ruleset in userspace to > your compiler, which would then insert the resulting eBPF program. > > You could even have a LD_PRELOAD wrapper doing the same. That one > would even work with direct users of the iptables setsockopt inteerface. > > Why add quite comprehensive kerne infrastructure? What's the motivation > here? Right, having a custom iptables, libiptc or LD_PRELOAD approach would work as well of course, but it still wouldn't address applications that have their own custom libs programmed against iptables uapi directly or those that reused a builtin or modified libiptc directly in their application. Such requests could only be covered transparently by having a small shim layer in kernel and it also wouldn't require any extra packages from distro side. [...] >> In the implemented proof of concept we show that simple /32 src/dst IPs >> are translated in such manner. > > Of course this is the first that one starts with. However, as we all > know, iptables was never very good or efficient about 5-tuple matching. > If you want a fast implementation of this, you don't use iptables which > does linear list iteration. The reason/rationale/use-case of iptables > is its many (I believe more than 100 now?) extensions both on the area > of matches and targets. > > Some of those can be implemented easily in BPF (like recomputing the > checksum or the like). Some others I would find much more difficult - > particularly if you want to off-load it to the NIC. They require access > to state that only the kernel has (like 'cgroup' or 'owner' matching). Yeah, when it comes to offloading, the latter two examples are heavily tied to upper layers of the (local) stack, so for cases like those it wouldn't make much sense, but e.g. matches, mangling or forwarding based on packet data are obvious candidates that can already be offloaded today in a flexible and programmable manner all with existing BPF infra, so for those it could definitely be highly interesting to make use of it. >> In the below example, we show that dumping, loading and offloading of >> one or multiple simple rules work, we show the bpftool XDP dump of the >> generated BPF instruction sequence as well as a simple functional ping >> test to enforce poli
Re: [PATCH RFC 0/4] net: add bpfilter
Daniel Borkmann wrote: > As rule translation can potentially become very complex, this is performed > entirely in user space. In order to ease deployment, request_module() code > is extended to allow user mode helpers to be invoked. Idea is that user mode > helpers are built as part of the kernel build and installed as traditional > kernel modules with .ko file extension into distro specified location, > such that from a distribution point of view, they are no different than > regular kernel modules. Thus, allow request_module() logic to load such > user mode helper (umh) binaries via: > > request_module("foo") -> > call_umh("modprobe foo") -> > sys_finit_module(FD of /lib/modules/.../foo.ko) -> > call_umh(struct file) > > Such approach enables kernel to delegate functionality traditionally done > by kernel modules into user space processes (either root or !root) Unrelated: AFAIU this would allow to e.g. move the compat32 handlers (which are very ugly/error prone) off to userspace? compat_syscall -> umh_32_64_xlate -> syscall() ? [ feel free to move this to different thread, only mentioning this so I won't forget ]
Re: [PATCH RFC 0/4] net: add bpfilter
Harald Welte wrote: > I believe _if_ one wants to use the approach of "hiding" eBPF behind > iptables, then either [..] > b) you must introduce new 'tables', like an 'xdp' table which then has >the notion of processing very early in processing, way before the >normal filter table INPUT processing happens. In nftables. the netdev ingress hook location could be used for this, but right, iptables has no equivalent. netdev ingress is interesting from an hw-offload point of view, unlike all other netfilter hooks its tied to a specific network interface rather than owned by the network namespace. A rule like (yes i am making this up) limit 1 byte/s cannot be offloaded because it affects all packets going through the system, i.e. you'd need to share state among all nics which i think won't work :-) Same goes for any other match/target that somehow contains (global) state and was added to the 'classic' iptables hook points. (exception: rule restricts interface via '-i foo'). Note well: "offloaded != ebpf" in this case. I see no reasons why ebpf cannot be used in either iptables or nftables. How to get there is obviously a different beast. For iptables, I think we should put it in maintenance mode and focus on nftables, for many reasons outlined in other replies. And how to best make use of ebpf+nftables In ideal world, nftables would have used (e)bpf from the start. But, well, its not an ideal world (iirc nft origins are just a bit too old). That doesn't mean that we can't leverage ebpf from nftables. Its just a question of where it makes sense and where it doesn't, f.e. i see no reason to replace c code with ebpf just 'because you can'. Speedup? Good argument. Feature enhancements that could use ebpf programs? Another good argument. I guess there are a lot more. So I'd like to second Haralds question. What is the main goal? For nftables, I believe most important ones are: - make kernel keeper/owner of all rules - allow userspace to learn of rule addition/deletion - provide fast matching (no linear evaluation of rules, native sets with jump and verdict maps) - provide a single tool instead of ip/ip6/arp/ebtables - unified ipv4/ipv6 matching - backwards compat and/or translation infrastructure But once these are reached, we will hopefully have more: - offloading (hardware) - speedup via JIT compilation - feature enhancements such as matching arbitrary packet contents I suspect you see that ebpf might be a fit and/or help us with all of these things. So, once I understand what your goals are I might be better able to see how nftables could fit into the picture, as you can see I did a lot of guesswork :-)
Re: [PATCH RFC 0/4] net: add bpfilter
Florian Westphal wrote: > David Miller wrote: > > From: Florian Westphal > > Date: Fri, 16 Feb 2018 17:14:08 +0100 > > > > > Any particular reason why translating iptables rather than nftables > > > (it should be possible to monitor the nftables changes that are > > > announced by kernel and act on those)? > > > > As Daniel said, iptables is by far the most deployed of the two > > technologies. Therefore it provides the largest environment for > > testing and coverage. > > Right, but the approach of hooking old blob format comes with > lots of limitations that were meant to be resolved with a netlink based > interface which places kernel in a position to mediate all transactions > to the rule database (which isn't fixable with old setsockopt format). > > As all programs call iptables(-restore) or variants translation can > be done in userspace to nftables so api spoken is nfnetlink. > Such a translator already exists and can handle some cases already: > > nft flush ruleset > nft list ruleset | wc -l > 0 > xtables-compat-multi iptables -A INPUT -i eth0 -m conntrack --ctstate > ESTABLISHED,RELATED -j ACCEPT > xtables-compat-multi iptables -A REJECT_LOG -i eth0 -p tcp --tcp-flags > SYN,ACK SYN --dport 22:80 -m limit --limit 1/sec -j LOG --log-prefix > "RejectTCPConnectReq" to be fair, for these two I had to use $(xtables-compat-multi iptables-translate -A INPUT -i eth0 -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT) Reason is that the 'iptables-translate' part nowadays has way more translations available (nft gained many features since the iptables-compat layer was added). If given appropriate prioriy however it should be pretty trivial to make the 'translate' descriptions available in the 'direct' version, we already have function in libnftables to execute/run a command directly from a buffer so this would not even need fork/execve overhead (although I don't think its a big concern). > (f.e. nftables misses some selinux matches/targets for netlabel so we > obviously > can't translate this, same for ipsec sa/policy matching -- but this isn't > impossible to resolve). I am working on some poc code for the sa/policy thing now.
Re: [PATCH RFC 0/4] net: add bpfilter
David Miller wrote: > From: Florian Westphal > Date: Fri, 16 Feb 2018 17:14:08 +0100 > > > Any particular reason why translating iptables rather than nftables > > (it should be possible to monitor the nftables changes that are > > announced by kernel and act on those)? > > As Daniel said, iptables is by far the most deployed of the two > technologies. Therefore it provides the largest environment for > testing and coverage. Right, but the approach of hooking old blob format comes with lots of limitations that were meant to be resolved with a netlink based interface which places kernel in a position to mediate all transactions to the rule database (which isn't fixable with old setsockopt format). As all programs call iptables(-restore) or variants translation can be done in userspace to nftables so api spoken is nfnetlink. Such a translator already exists and can handle some cases already: nft flush ruleset nft list ruleset | wc -l 0 xtables-compat-multi iptables -A INPUT -s 192.168.0.24 -j ACCEPT xtables-compat-multi iptables -A INPUT -s 192.168.0.0/16 -p tcp --dport 22 -j ACCEPT xtables-compat-multi iptables -A INPUT -i eth0 -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT xtables-compat-multi iptables -A INPUT -p icmp -j ACCEPT xtables-compat-multi iptables -N REJECT_LOG xtables-compat-multi iptables -A REJECT_LOG -i eth0 -p tcp --tcp-flags SYN,ACK SYN --dport 22:80 -m limit --limit 1/sec -j LOG --log-prefix "RejectTCPConnectReq" xtables-compat-multi iptables -A REJECT_LOG -j DROP xtables-compat-multi iptables -A INPUT -j REJECT_LOG nft list ruleset table ip filter { chain INPUT { type filter hook input priority 0; policy accept; ip saddr 192.168.0.24 counter packets 0 bytes 0 accept ip saddr 192.168.0.0/16 tcp dport 22 counter accept iifname "eth0" ct state related,established counter accept ip protocol icmp counter packets 0 bytes 0 accept counter packets 0 bytes 0 jump REJECT_LOG } chain FORWARD { type filter hook forward priority 0; policy accept; } chain OUTPUT { type filter hook output priority 0; policy accept; } chain REJECT_LOG { iifname "eth0" tcp dport 22-80 tcp flags & (syn | ack) == syn limit rate 1/second burst 5 packets counter packets 0 bytes 0 log prefix "RejectTCPConnectReq" counter packets 0 bytes 0 drop } } and, while 'iptables' rules were added, nft monitor in different terminal: nft monitor add table ip filter add chain ip filter INPUT { type filter hook input priority 0; policy accept; } add chain ip filter FORWARD { type filter hook forward priority 0; policy accept; } add chain ip filter OUTPUT { type filter hook output priority 0; policy accept; } add rule ip filter INPUT ip saddr 192.168.0.24 counter packets 0 bytes 0 accept # new generation 9893 by process 7471 (xtables-compat-) add rule ip filter INPUT ip saddr 192.168.0.0/16 tcp dport 22 counter accept # new generation 9894 by process 7504 (xtables-compat-) add rule ip filter INPUT iifname "eth0" ct state related,established counter accept # new generation 9895 by process 7528 (xtables-compat-) add rule ip filter INPUT ip protocol icmp counter packets 0 bytes 0 accept # new generation 9896 by process 7542 (xtables-compat-) add chain ip filter REJECT_LOG # new generation 9897 by process 7595 (xtables-compat-) add rule ip filter REJECT_LOG iifname "eth0" tcp dport 22-80 tcp flags & (syn | ack) == syn limit rate 1/second burst 5 packets counter packets 0 bytes 0 log prefix "RejectTCPConnectReq" # new generation 9898 by process 7639 (xtables-compat-) add rule ip filter REJECT_LOG counter packets 0 bytes 0 drop # new generation 9899 by process 7657 (xtables-compat-) add rule ip filter INPUT counter packets 0 bytes 0 jump REJECT_LOG # new generation 9900 by process 7663 (xtables-compat-) Now, does this work in all cases? Unfortunately not -- this is still work-in-progress, so I would not rm /sbin/iptables and replace it with a link to xtables-compat-multi just yet. (f.e. nftables misses some selinux matches/targets for netlabel so we obviously can't translate this, same for ipsec sa/policy matching -- but this isn't impossible to resolve). Hopefully this does show that at least some commonly used features work and that we've come a long way to make seamless nftables transition happen.
Re: [PATCH RFC 0/4] net: add bpfilter
Daniel Borkmann wrote: > Hi Florian, > > On 02/16/2018 05:14 PM, Florian Westphal wrote: > > Florian Westphal wrote: > >> Daniel Borkmann wrote: > >> Several questions spinning at the moment, I will probably come up with > >> more: > > > > ... and here there are some more ... > > > > One of the many pain points of xtables design is the assumption of 'used > > only by sysadmin'. > > > > This has not been true for a very long time, so by now iptables has > > this userspace lock (yes, its fugly workaround) to serialize concurrent > > iptables invocations in userspace. > > > > AFAIU the translate-in-userspace design now brings back the old problem > > of different tools overwriting each others iptables rules. > > Right, so the behavior would need to be adapted to be exactly the same, > given all the requests go into kernel space first via the usual uapis, > I don't think there would be anything in the way of keeping that as is. Uff. This isn't solveable. At least thats what I tried to say here. This is a limitation of the xtables setsockopt interface design. If $docker (or anything else) adds a new rule using plain iptables other daemons are not aware of it. If some deletes a rule added by $software it won't learn that either. The "solutions" in place now (periodic reloads/'is my rule still in place' etc. are not desirable long-term. You'll also need 4 decoders for arp/ip/ip6/ebtables plus translations for all matches and targets xtables currently has. (almost 100 i would guess from quick glance). Some of the more crazy ones also have external user visible interfaces outside setsockopt (proc files, ipset). > > One of the nftables advantages is that (since rule representation in > > kernel is black-box from userspace point of view) is that the kernel > > can announce add/delete of rules or elements from nftables sets. > > > > Any particular reason why translating iptables rather than nftables > > (it should be possible to monitor the nftables changes that are > > announced by kernel and act on those)? > > Yeah, correct, this should be possible as well. We started out with the > iptables part in the demo as the majority of bigger infrastructure projects > all still rely heavily on it (e.g. docker, k8s to just name two big ones). Yes, which is why we have translation tools in place. Just for the fun of it I tried to delete ip/ip6tables binaries on my fedora27 laptop and replaced them with symlinks to 'xtables-compat-multi'. Aside from two issues (SELinux denying 'iptables' to use netlink) and one translation issue (-m rpfilter, which can be translated in current upstream version) this works out of the box, the translator uses nftables api to kernel (so kernel doesn't even know which program is talking...), 'nft monitor' displays the rules being added, and 'nft list ruleset' shows the default firewalld ruleset. Obviously there are a few limitations, for instance ip6tables-save will stop working once you add nft-based rules that use features that cannot be expressed in xtables syntax (it will throw an error message similar to 'you are using nftables featues not available in xtables, please use nft'), for intance verdict maps, sets and the like. > Usually they have their requests to iptables baked into their code directly > which probably won't change any time soon, so thought was that they could > benefit initially from it once there would be sufficient coverage. See above, the translator covers most basic use cases nowadays. The more extreme cases are not covered because we were reluctant to provide equivalent in nftables (-m time comes to mind which was always a PITA because kernel has no notion of timezone or DST transitions, leading to 'magic' mismatches when timezone changes... I could explain on more problem cases but none of them are too important I think. If you'd like to have more ebpf users in the kernel, then there is at least one use case where ebpf could be very attractive for nftables (matching dynamic headers and the like). This would be a new feature and would need changes on nftables userspace side as well (we don't have syntax/grammar to represent this in either nft or iptables). In most basic form, it would be nftables replacement for '-m string' (and perhaps also -m bpf to some degree, depends on how it would be realized). We can discuss more if there is interest, but I think it would be more suitable for conference/face to face discussion.
Re: [PATCH RFC 0/4] net: add bpfilter
Hi Daniel, On Fri, Feb 16, 2018 at 02:40:19PM +0100, Daniel Borkmann wrote: > This is a very rough and early proof of concept that implements bpfilter. > The basic idea of bpfilter is that it can process iptables queries and > translate them in user space into BPF programs which can then get attached > at various locations. Interesting approach. My first question would be what the goal of all of this is. For sure, one can implement many different things, but what is the use case, and why do it this way? I see several possible areas of contention: 1) If you aim for a non-feature-complete support of iptables rules, it will create confusion to the users. When users use "iptables", they have assumptions on what it will do and how it will behave. One can of course replace / refactor the internal implementation, if the resulting behavior is identical. And that means rules are executed at the same hooks in the stack, with functionally identical matches and targets, provide the same counter semantics, etc. But if the behavior is different, and/or the provided functionality is different, then why "hide" this new filtering technology behind iptables, rather than its own command line tool? Such an alternative tool could share the same command line syntax as iptables, or even provide a converter/wrapper, but given that it would not be called "iptables" people will implicitly have different assumptions about it 2) Why try to provide compatibility to iptables, when at the same time many people have already migrated to (or are in the process of migrating) to nftables? By using iptables semantics, structures, architecture, you risk perpetuating the design mistakes we made in iptables some 18 years ago for another decade or more. From my POV, if one was to do eBPF optimized rule execution, it should be based on nftables rather than iptables. This way you avoid the many architectural problems, such as * no incremental rule changes but only atomic swap of an entire table with all its chains * no common/shared rulesets for IPv4 + IPv6, which is very clumsy and often worked around with ugly shellscript wrappers in userspace which then call both iptables and ip6tables to add a rule to both rulesets. > The user space iptables binary issuing rule addition or dumps was > left as-is, thus at some point any binaries against iptables uapi kernel > interface could transparently be supported in such manner in long term. See my comments above: In the netfilter community, we know for at least a decade or more about the many problems of the old iptables userspace interface. For many years, a much better replacement has been designed as part of nftables. > As rule translation can potentially become very complex, this is performed > entirely in user space. In order to ease deployment, request_module() code > is extended to allow user mode helpers to be invoked. Idea is that user mode > helpers are built as part of the kernel build and installed as traditional > kernel modules with .ko file extension into distro specified location, > such that from a distribution point of view, they are no different than > regular kernel modules. That just blew my mind, sorry :) This goes much beyond netfilter/iptables, and adds some quiet singificant new piece of kernel/userspace infrastructure. To me, my apologies, it just sounds like a quite strange hack. But then, I may lack the vision of how this might be useful in other contexts. I'm trying to understand why exactly one would * use a 18 year old iptables userspace program with its equally old setsockopt based interface between kernel and userspace * insert an entire table with many chains of rules into the kernel * re-eject that ruleset into another userspace program which then compiles it into an eBPF program * inserert that back into the kernel To me, this looks like some kind of legacy backwards compatibility mechanism that one would find in proprietary operating systems, but not in Linux. iptables, libiptc etc. are all free software. The source code can be edited, and you could just as well have a new version of iptables and/or libiptc which would pass the ruleset in userspace to your compiler, which would then insert the resulting eBPF program. You could even have a LD_PRELOAD wrapper doing the same. That one would even work with direct users of the iptables setsockopt inteerface. Why add quite comprehensive kerne infrastructure? What's the motivation here? > Thus, allow request_module() logic to load such > user mode helper (umh) binaries via: > > request_module("foo") -> > call_umh("modprobe foo") -> > sys_finit_module(FD of /lib/modules/.../foo.ko) -> > call_umh(struct file) > > Such approach enables kernel to delegate functionality traditionally done > by kernel modules into user space processes (either root or !root) and > reduces security attack sur
Re: [PATCH RFC 0/4] net: add bpfilter
Hi Daniel, On Fri, Feb 16, 2018 at 09:44:01PM +0100, Daniel Borkmann wrote: > We started out with the > iptables part in the demo as the majority of bigger infrastructure projects > all still rely heavily on it (e.g. docker, k8s to just name two big ones). docker is exec'ing the iptables command line program. So one could simply offer a syntactically compatible userspace replacement that does the compilation in userspce and avoid the iptables->libiptc->setsockopt->userspace roundtrip and the associated changes to the kernel module loader you introduced. kubernetes is using iptables-restore, which is part of iptables and again has the same syntax. However, it aovids the per-rule fork+exec overhead, which is why the netfilter project has been recommending it to be used in such situations. Do you have a list of known projects that use the legacy sockopt-based iptables uapi directly, without using code from the iptables.git codebase (e.g. libiptc, iptables or iptables-restore)? IMHO only those projects would benefit from the approach you have taken vs. an approach that simply offers a compatible commandline syntax. > Usually they have their requests to iptables baked into their code directly > which probably won't change any time soon, so thought was that they could > benefit initially from it once there would be sufficient coverage. If the binary offeers the same syntax (it could even be a fork/version of the iptables codebase, only using the parsing without the existing backend generating the ruleS), the same goal could be achieved. The above of course assumes that you have a 100% functional replacement (for 100% of the features that your use cases use) underneath the "iptables command syntax" compatibility. But you need that in both cases, whether you use the existing userspace api or not. Regards, Harald -- - Harald Weltehttp://laforge.gnumonks.org/ "Privacy in residential applications is a desirable marketing option." (ETSI EN 300 175-7 Ch. A6)
Re: [PATCH RFC 0/4] net: add bpfilter
Hi David, On Fri, Feb 16, 2018 at 05:33:54PM -0500, David Miller wrote: > From: Florian Westphal > > > Any particular reason why translating iptables rather than nftables > > (it should be possible to monitor the nftables changes that are > > announced by kernel and act on those)? > > As Daniel said, iptables is by far the most deployed of the two > technologies. Therefore it provides the largest environment for > testing and coverage. As I outlined earlier, this way you are perpetuating the architectural mistakes and constraints that were created ~ 18 years ago without any benefit from the lessons learned ever since. In netfilter, we already wanted to replace it as early as 2006 (AFAIR) with nfnetlink based pkttables (which never materialized). I would strongly suggest to focus on nftables (or even some other way of configuration / userspace interaction) to ensure that the iptables userspace interface can at some point be phased out eventually. Like we did with ipchains before, and before that with ipfwadm. By making a new implementation dependant on the oldest interface you are perpetuating it. Sure, one can go that way, but I would suggest this to be a *very* carefully weighed decision after a detailed analysis/discusison. -- - Harald Weltehttp://laforge.gnumonks.org/ "Privacy in residential applications is a desirable marketing option." (ETSI EN 300 175-7 Ch. A6)
Re: [PATCH RFC 0/4] net: add bpfilter
From: Florian Westphal Date: Fri, 16 Feb 2018 17:14:08 +0100 > Any particular reason why translating iptables rather than nftables > (it should be possible to monitor the nftables changes that are > announced by kernel and act on those)? As Daniel said, iptables is by far the most deployed of the two technologies. Therefore it provides the largest environment for testing and coverage.
Re: [PATCH RFC 0/4] net: add bpfilter
From: Florian Westphal Date: Fri, 16 Feb 2018 15:57:27 +0100 > 4. Do you plan to reimplement connection tracking in userspace? > If no, how will the bpf program interact with it? The natural way to handle this, as with anything BPF related, is with appropriate BPF helpers which would be added for this purpose.
Re: [PATCH RFC 0/4] net: add bpfilter
Hi Florian, On 02/16/2018 05:14 PM, Florian Westphal wrote: > Florian Westphal wrote: >> Daniel Borkmann wrote: >> Several questions spinning at the moment, I will probably come up with >> more: > > ... and here there are some more ... > > One of the many pain points of xtables design is the assumption of 'used > only by sysadmin'. > > This has not been true for a very long time, so by now iptables has > this userspace lock (yes, its fugly workaround) to serialize concurrent > iptables invocations in userspace. > > AFAIU the translate-in-userspace design now brings back the old problem > of different tools overwriting each others iptables rules. Right, so the behavior would need to be adapted to be exactly the same, given all the requests go into kernel space first via the usual uapis, I don't think there would be anything in the way of keeping that as is. > Another question -- am i correct in that each rule manipulation would > incur a 'recompilation'? Or are there different mini programs chained > together? Right now in the PoC yes, basically it regenerates the program on the fly in gen.c when walking the struct bpfilter_ipt_ip's and appends the entries to the program, but it doesn't have to be that way. There are multiple options to allow for a partial code generation, e.g. via chaining tail call arrays or directly via BPF to BPF calls eventually, there would be few changes on BPF side needed, but it can be done; there could additionally be various optimizations passes during code generation phase performed while keeping given constraints in order to speed up getting to a verdict. > One of the nftables advantages is that (since rule representation in > kernel is black-box from userspace point of view) is that the kernel > can announce add/delete of rules or elements from nftables sets. > > Any particular reason why translating iptables rather than nftables > (it should be possible to monitor the nftables changes that are > announced by kernel and act on those)? Yeah, correct, this should be possible as well. We started out with the iptables part in the demo as the majority of bigger infrastructure projects all still rely heavily on it (e.g. docker, k8s to just name two big ones). Usually they have their requests to iptables baked into their code directly which probably won't change any time soon, so thought was that they could benefit initially from it once there would be sufficient coverage. Thanks, Daniel
Re: [PATCH RFC 0/4] net: add bpfilter
Hi Florian, thanks for your feedback! More inline: On 02/16/2018 03:57 PM, Florian Westphal wrote: > Daniel Borkmann wrote: >> This is a very rough and early proof of concept that implements bpfilter. > > [..] > >> Also, as a benefit from such design, we get BPF JIT compilation on x86_64, >> arm64, ppc64, sparc64, mips64, s390x and arm32, but also rule offloading >> into HW for free for Netronome NFP SmartNICs that are already capable of >> offloading BPF since we can reuse all existing BPF infrastructure as the >> back end. The user space iptables binary issuing rule addition or dumps was >> left as-is, thus at some point any binaries against iptables uapi kernel >> interface could transparently be supported in such manner in long term. >> >> As rule translation can potentially become very complex, this is performed >> entirely in user space. In order to ease deployment, request_module() code >> is extended to allow user mode helpers to be invoked. Idea is that user mode >> helpers are built as part of the kernel build and installed as traditional >> kernel modules with .ko file extension into distro specified location, >> such that from a distribution point of view, they are no different than >> regular kernel modules. Thus, allow request_module() logic to load such >> user mode helper (umh) binaries via: >> >> request_module("foo") -> >> call_umh("modprobe foo") -> >> sys_finit_module(FD of /lib/modules/.../foo.ko) -> >> call_umh(struct file) >> >> Such approach enables kernel to delegate functionality traditionally done >> by kernel modules into user space processes (either root or !root) and >> reduces security attack surface of such new code, meaning in case of >> potential bugs only the umh would crash but not the kernel. Another >> advantage coming with that would be that bpfilter.ko can be debugged and >> tested out of user space as well (e.g. opening the possibility to run >> all clang sanitizers, fuzzers or test suites for checking translation). > > Several questions spinning at the moment, I will probably come up with > more: Sure, no problem at all. It's an early RFC, so purpose is to get a discussion going on such potential approach. > 1. Does this still attach the binary blob to the 'normal' iptables >hooks? Yeah, so thought would be to keep the user land tooling functional as is w/o having to recompile binaries, thus this would also need to attach for the existing hooks in order to keep semantics working. As a benefit in addition we can also reuse all the rest of the infrastructure to utilize things like XDP for iptables in the background, there is definitely flexibility on this side thus users could eventually benefit from this transparently and don't need to know that 'bpfilter' exists and is translating in the background. I realize taking this path is a long term undertake that we would need to tackle as a community, not just one or two individuals when we decide to go for this direction. > 2. If yes, do you see issues wrt. 'iptables' and 'bpfilter' attached > programs being different in nature (e.g. changed by different entities)? There could certainly be multiple options, e.g. a fall-through with state transfer once a request cannot be handled yet or a sysctl with iptables being the default handler and an option to switch to bpfilter for letting it handle requests for that time being. > 3. What happens if the rule can't be translated (yet?) (See above.) > 4. Do you plan to reimplement connection tracking in userspace? One option could be to have a generic, skb-less connection tracker in kernel that can be reused from the various hooks it would need to handle, potentially that would also be able to get offloaded into HW as another benefit coming out from that. > If no, how will the bpf program interact with it? > [ same question applies to ipv6 exthdr traversal, ip defragmentation > and the like ]. The v6 exthdr traversal could be realized natively via BPF which should make the parsing more robust at the same time than having it somewhere inside a helper in kernel directly; bounded loops in BPF would help as well on that front, similarly for defrag this could be handled by the prog although here we would need additional infra to queue the packets and then recirculate. > I will probably have a quadrillion of followup questions, sorry :-/ Definitely, please do! Thanks, Daniel >> Also, such architecture makes the kernel/user boundary very precise, >> meaning requests can be handled and BPF translated in control plane part >> in user space with its own user memory etc, while minimal data plane >> bits are in kernel. It would also allow to remove old xtables modules >> at some point from the kernel while keeping functionality in place. > > This is what we tried with nftables :-/
Re: [PATCH RFC 0/4] net: add bpfilter
Florian Westphal wrote: > Daniel Borkmann wrote: > Several questions spinning at the moment, I will probably come up with > more: ... and here there are some more ... One of the many pain points of xtables design is the assumption of 'used only by sysadmin'. This has not been true for a very long time, so by now iptables has this userspace lock (yes, its fugly workaround) to serialize concurrent iptables invocations in userspace. AFAIU the translate-in-userspace design now brings back the old problem of different tools overwriting each others iptables rules. Another question -- am i correct in that each rule manipulation would incur a 'recompilation'? Or are there different mini programs chained together? One of the nftables advantages is that (since rule representation in kernel is black-box from userspace point of view) is that the kernel can announce add/delete of rules or elements from nftables sets. Any particular reason why translating iptables rather than nftables (it should be possible to monitor the nftables changes that are announced by kernel and act on those)?
Re: [PATCH RFC 0/4] net: add bpfilter
Daniel Borkmann wrote: > This is a very rough and early proof of concept that implements bpfilter. [..] > Also, as a benefit from such design, we get BPF JIT compilation on x86_64, > arm64, ppc64, sparc64, mips64, s390x and arm32, but also rule offloading > into HW for free for Netronome NFP SmartNICs that are already capable of > offloading BPF since we can reuse all existing BPF infrastructure as the > back end. The user space iptables binary issuing rule addition or dumps was > left as-is, thus at some point any binaries against iptables uapi kernel > interface could transparently be supported in such manner in long term. > > As rule translation can potentially become very complex, this is performed > entirely in user space. In order to ease deployment, request_module() code > is extended to allow user mode helpers to be invoked. Idea is that user mode > helpers are built as part of the kernel build and installed as traditional > kernel modules with .ko file extension into distro specified location, > such that from a distribution point of view, they are no different than > regular kernel modules. Thus, allow request_module() logic to load such > user mode helper (umh) binaries via: > > request_module("foo") -> > call_umh("modprobe foo") -> > sys_finit_module(FD of /lib/modules/.../foo.ko) -> > call_umh(struct file) > > Such approach enables kernel to delegate functionality traditionally done > by kernel modules into user space processes (either root or !root) and > reduces security attack surface of such new code, meaning in case of > potential bugs only the umh would crash but not the kernel. Another > advantage coming with that would be that bpfilter.ko can be debugged and > tested out of user space as well (e.g. opening the possibility to run > all clang sanitizers, fuzzers or test suites for checking translation). Several questions spinning at the moment, I will probably come up with more: 1. Does this still attach the binary blob to the 'normal' iptables hooks? 2. If yes, do you see issues wrt. 'iptables' and 'bpfilter' attached programs being different in nature (e.g. changed by different entities)? 3. What happens if the rule can't be translated (yet?) 4. Do you plan to reimplement connection tracking in userspace? If no, how will the bpf program interact with it? [ same question applies to ipv6 exthdr traversal, ip defragmentation and the like ]. I will probably have a quadrillion of followup questions, sorry :-/ > Also, such architecture makes the kernel/user boundary very precise, > meaning requests can be handled and BPF translated in control plane part > in user space with its own user memory etc, while minimal data plane > bits are in kernel. It would also allow to remove old xtables modules > at some point from the kernel while keeping functionality in place. This is what we tried with nftables :-/
[PATCH RFC 0/4] net: add bpfilter
This is a very rough and early proof of concept that implements bpfilter. The basic idea of bpfilter is that it can process iptables queries and translate them in user space into BPF programs which can then get attached at various locations. For simplicity, in this RFC we demo attaching them to XDP layer, but any other location would work as well (e.g. at the tc sch_clsact ingress/egress location or any other/new hook with equivalent semantics). Also, as a benefit from such design, we get BPF JIT compilation on x86_64, arm64, ppc64, sparc64, mips64, s390x and arm32, but also rule offloading into HW for free for Netronome NFP SmartNICs that are already capable of offloading BPF since we can reuse all existing BPF infrastructure as the back end. The user space iptables binary issuing rule addition or dumps was left as-is, thus at some point any binaries against iptables uapi kernel interface could transparently be supported in such manner in long term. As rule translation can potentially become very complex, this is performed entirely in user space. In order to ease deployment, request_module() code is extended to allow user mode helpers to be invoked. Idea is that user mode helpers are built as part of the kernel build and installed as traditional kernel modules with .ko file extension into distro specified location, such that from a distribution point of view, they are no different than regular kernel modules. Thus, allow request_module() logic to load such user mode helper (umh) binaries via: request_module("foo") -> call_umh("modprobe foo") -> sys_finit_module(FD of /lib/modules/.../foo.ko) -> call_umh(struct file) Such approach enables kernel to delegate functionality traditionally done by kernel modules into user space processes (either root or !root) and reduces security attack surface of such new code, meaning in case of potential bugs only the umh would crash but not the kernel. Another advantage coming with that would be that bpfilter.ko can be debugged and tested out of user space as well (e.g. opening the possibility to run all clang sanitizers, fuzzers or test suites for checking translation). Also, such architecture makes the kernel/user boundary very precise, meaning requests can be handled and BPF translated in control plane part in user space with its own user memory etc, while minimal data plane bits are in kernel. It would also allow to remove old xtables modules at some point from the kernel while keeping functionality in place. In the implemented proof of concept we show that simple /32 src/dst IPs are translated in such manner. More complex rules would be added later as well, also different BPF code generation backends that can be selected for the various attachment points, proper encoder/decoder for the uapi requests, etc. This just starts out very simple and basic for the sake of an early RFC to demo the idea. In the below example, we show that dumping, loading and offloading of one or multiple simple rules work, we show the bpftool XDP dump of the generated BPF instruction sequence as well as a simple functional ping test to enforce policy in such way. Set rebased on top of 255442c93843 ("Merge tag 'docs-4.16' of [...]"). Feedback very welcome! Various bpfilter usage examples from the PoC code: 1) Dumping current rules: # iptables -t filter -L Chain INPUT (policy ACCEPT) target prot opt source destination Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination 2) ping test: # ping -c 1 127.0.0.1 -I 127.0.0.2 PING 127.0.0.1 (127.0.0.1) from 127.0.0.2 : 56(84) bytes of data. 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.040 ms --- 127.0.0.1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.040/0.040/0.040/0.000 ms 3) Adding & dumping a simple rule: # iptables -t filter -A INPUT -i lo -s 127.0.0.2/32 -d 127.0.0.1/32 -j DROP # iptables -t filter -L Chain INPUT (policy ACCEPT) target prot opt source destination DROP all -- 127.0.0.2localhost Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination 4) Dump BPF generated code for that rule (on lo it's XDP generic, otherwise native XDP for XDP supported drivers): # bpftool p 18: xdp tag 6b07f663830d5b0c loaded_at Feb 14/01:15 uid 0 xlated 208B not jited memlock 4096B # bpftool p d x i 18 0: (bf) r9 = r1 1: (79) r2 = *(u64 *)(r9 +0) 2: (79) r3 = *(u64 *)(r9 +8) 3: (bf) r1 = r2 4: (07) r1 += 14 5: (bd) if r1 <= r3 goto pc+2 6: (b4) (u32) r0 = (u32) 2 7: (95) exit 8: (bf) r1 = r2 9: (b4) (u32) r5 = (u32) 0 10: (69) r4 = *(u16 *)(r1 +12) 11: (55) if r4 != 0x8 go