subject:"\[PATCH RFC 0\/4\] net\: add bpfilter"

Re: nft/bpf interpreters and spectre2. Was: [PATCH RFC 0/4] net: add bpfilter

2018-02-22 Thread Jann Horn

[resend as plaintext, apparently mobile gmail will send HTML mails]

On Thu, Feb 22, 2018 at 3:20 AM, Alexei Starovoitov
 wrote:
> On Wed, Feb 21, 2018 at 01:13:03PM +0100, Florian Westphal wrote:
>>
>> Obvious candidates are: meta, numgen, limit, objref, quota, reject.
>>
>> We should probably also consider removing
>> CONFIG_NFT_SET_RBTREE and CONFIG_NFT_SET_HASH and just always
>> build both too (at least rbtree since that offers interval).
>>
>> For the indirect call issue we can use direct calls from eval loop for
>> some of the more frequently used ones, similar to what we do already
>> for nft_cmp_fast_expr.
>
> nft_cmp_fast_expr and other expressions mentioned above made me thinking...
>
> do we have the same issue with nft interpreter as we had with bpf one?
> bpf interpreter was used as part of spectre2 attack to leak
> information via cache side channel and let VM read hypervisor memory.
> Due to that issue we removed bpf interpreter from the kernel code.
> That's what CONFIG_BPF_JIT_ALWAYS_ON for...
> but we still have nft interpreter in the kernel that can also
> execute arbitrary nft expressions.
>
> Jann's exploit used the following bpf instructions:
[...]
>
> and a gadget to jump into __bpf_prog_run with insn pointing
> to memory controlled by the guest while accessible
> (at different virt address) by the hypervisor.
>
> It seems possible to construct similar sequence of instructions
> out of nft expressions and use gadget that jumps into nft_do_chain().
[...]
> Obviously such exploit is harder to do than bpf based one.
> Do we need to do anything about it ?
> May be it's easier to find gadgets in .text of vmlinux
> instead of messing with interpreters?
>
> Jann,
> can you comment on removing interpreters in general?
> Do we need to worry about having bpf and/or nft interpreter
> in the kernel?

I think that for Spectre V2, the presence of interpreters isn't a big
problem. It simplifies writing attacks a bit, but I don't expect it to
be necessary if an attacker invests some time into finding useful
gadgets.

Re: nft/bpf interpreters and spectre2. Was: [PATCH RFC 0/4] net: add bpfilter

2018-02-22 Thread Alexei Starovoitov

On Thu, Feb 22, 2018 at 12:39:15PM +0100, Pablo Neira Ayuso wrote:
> Hi Alexei,
> 
> On Wed, Feb 21, 2018 at 06:20:37PM -0800, Alexei Starovoitov wrote:
> > On Wed, Feb 21, 2018 at 01:13:03PM +0100, Florian Westphal wrote:
> > > 
> > > Obvious candidates are: meta, numgen, limit, objref, quota, reject.
> > > 
> > > We should probably also consider removing
> > > CONFIG_NFT_SET_RBTREE and CONFIG_NFT_SET_HASH and just always
> > > build both too (at least rbtree since that offers interval).
> > > 
> > > For the indirect call issue we can use direct calls from eval loop for
> > > some of the more frequently used ones, similar to what we do already
> > > for nft_cmp_fast_expr. 
> > 
> > nft_cmp_fast_expr and other expressions mentioned above made me thinking...
> > 
> > do we have the same issue with nft interpreter as we had with bpf one?
> > bpf interpreter was used as part of spectre2 attack to leak
> > information via cache side channel and let VM read hypervisor memory.
> > Due to that issue we removed bpf interpreter from the kernel code.
> > That's what CONFIG_BPF_JIT_ALWAYS_ON for...
> > but we still have nft interpreter in the kernel that can also
> > execute arbitrary nft expressions.
> > 
> > Jann's exploit used the following bpf instructions:
> > struct bpf_insn evil_bytecode_instrs[] = {
> > // rax = target_byte_addr
> > { .code = BPF_LD | BPF_IMM | BPF_DW, .dst_reg = 0, .imm = target_byte_addr 
> > }, { .imm = target_byte_addr>>32 },
> 
> We don't place pointers in the nft VM registers, it's basically
> illegal to do so, otherwise we would need more sophisticated verifier.
> I'm telling this because we don't have a way to point to any arbitrary
> address as in 'target_byte_addr' above.

these evil_bytecode_instrs never saw bpf verifier either.
That's the scary part of that poc.
The only requirement for poc to work is to have interpreter
in executable part of hypervisor code and speculatively jump into it
with arguments pointing to memory controlled by vm.
All static checks (done by bpf verifier and by nft validation) are bypassed.
The only way to defend from such exploit is either remove the interpreter
from the kernel or add _run-time_ checks and masks for every memory access
(similar to what is done for spectre1 mitigations).
In case of bpf it's impractical.
In case of nft I suspect so too. I don't yet see how nft can check
that skb pointer passed as part of nft_pktinfo is not an actual skb.

> > // rdi = timing_leak_array
> > { .code = BPF_LD | BPF_IMM | BPF_DW, .dst_reg = 1, .imm = 
> > host_timing_leak_addr }, { .imm = host_timing_leak_addr>>32 },
> > // rax = *(u8*)rax
> > { .code = BPF_LDX | BPF_MEM | BPF_B, .dst_reg = 0, .src_reg = 0, .off = 0 },
> > // rax = rax << ...
> > { .code = BPF_ALU64 | BPF_LSH | BPF_K, .dst_reg = 0, .imm = 10 - bit_idx },
> > // rax = rax & 0x400
> > { .code = BPF_ALU64 | BPF_AND | BPF_K, .dst_reg = 0, .imm = 0x400 },
> > // rax = rdi + rax
> > { .code = BPF_ALU64 | BPF_ADD | BPF_X, .dst_reg = 0, .src_reg = 1 },
> > // *(u8*) (rax + 0x800)
> > { .code = BPF_LDX | BPF_MEM | BPF_B, .dst_reg = 0, .src_reg = 0, .off = 
> > 0x800 },
> > 
> > and a gadget to jump into __bpf_prog_run with insn pointing
> > to memory controlled by the guest while accessible
> > (at different virt address) by the hypervisor.
> > 
> > It seems possible to construct similar sequence of instructions
> > out of nft expressions and use gadget that jumps into nft_do_chain().
> > The attacker would need to discover more kernel addresses:
> > nft_do_chain, nft_cmp_fast_ops, nft_payload_fast_ops, nft_bitwise_eval,
> > nft_lookup_eval, and nft_bitmap_lookup
> > to populate nft chains, rules and expressions in guest memory
> > comparing to bpf interpreter attack.
> > 
> > Then in nft_do_chain(struct nft_pktinfo *pkt, void *priv)
> > pkt needs to point to fake struct sk_buff in guest memory with
> > skb->head == target_byte_addr
> 
> We don't have a way to make this point to fake struct sk_buff.

yet. it's possible, since cpu is speculating and all such pointers
controlled by vm can be arbitrary.

> > The first nft expression can be nft_payload_fast_eval().
> > If it's properly constructed with
> > (nft_payload->based == NFT_PAYLOAD_NETWORK_HEADER, offset == 0, len == 0, 
> > dreg == 1)
> 
> We can reject len == 0. To be honest, this is not done right now, but
> we can place a patch to validate this. Given this is a specialized
> networking virtual machine, it retain semantics, so fetching zero
> length data from a skbuff makes no sense, hence, we can return EINVAL
> via netlink when adding a rule that tries to do this.

Adding static check won't help.

Re: nft/bpf interpreters and spectre2. Was: [PATCH RFC 0/4] net: add bpfilter

2018-02-22 Thread Pablo Neira Ayuso

Hi Alexei,

On Wed, Feb 21, 2018 at 06:20:37PM -0800, Alexei Starovoitov wrote:
> On Wed, Feb 21, 2018 at 01:13:03PM +0100, Florian Westphal wrote:
> > 
> > Obvious candidates are: meta, numgen, limit, objref, quota, reject.
> > 
> > We should probably also consider removing
> > CONFIG_NFT_SET_RBTREE and CONFIG_NFT_SET_HASH and just always
> > build both too (at least rbtree since that offers interval).
> > 
> > For the indirect call issue we can use direct calls from eval loop for
> > some of the more frequently used ones, similar to what we do already
> > for nft_cmp_fast_expr. 
> 
> nft_cmp_fast_expr and other expressions mentioned above made me thinking...
> 
> do we have the same issue with nft interpreter as we had with bpf one?
> bpf interpreter was used as part of spectre2 attack to leak
> information via cache side channel and let VM read hypervisor memory.
> Due to that issue we removed bpf interpreter from the kernel code.
> That's what CONFIG_BPF_JIT_ALWAYS_ON for...
> but we still have nft interpreter in the kernel that can also
> execute arbitrary nft expressions.
> 
> Jann's exploit used the following bpf instructions:
> struct bpf_insn evil_bytecode_instrs[] = {
> // rax = target_byte_addr
> { .code = BPF_LD | BPF_IMM | BPF_DW, .dst_reg = 0, .imm = target_byte_addr }, 
> { .imm = target_byte_addr>>32 },

We don't place pointers in the nft VM registers, it's basically
illegal to do so, otherwise we would need more sophisticated verifier.
I'm telling this because we don't have a way to point to any arbitrary
address as in 'target_byte_addr' above.

> // rdi = timing_leak_array
> { .code = BPF_LD | BPF_IMM | BPF_DW, .dst_reg = 1, .imm = 
> host_timing_leak_addr }, { .imm = host_timing_leak_addr>>32 },
> // rax = *(u8*)rax
> { .code = BPF_LDX | BPF_MEM | BPF_B, .dst_reg = 0, .src_reg = 0, .off = 0 },
> // rax = rax << ...
> { .code = BPF_ALU64 | BPF_LSH | BPF_K, .dst_reg = 0, .imm = 10 - bit_idx },
> // rax = rax & 0x400
> { .code = BPF_ALU64 | BPF_AND | BPF_K, .dst_reg = 0, .imm = 0x400 },
> // rax = rdi + rax
> { .code = BPF_ALU64 | BPF_ADD | BPF_X, .dst_reg = 0, .src_reg = 1 },
> // *(u8*) (rax + 0x800)
> { .code = BPF_LDX | BPF_MEM | BPF_B, .dst_reg = 0, .src_reg = 0, .off = 0x800 
> },
> 
> and a gadget to jump into __bpf_prog_run with insn pointing
> to memory controlled by the guest while accessible
> (at different virt address) by the hypervisor.
> 
> It seems possible to construct similar sequence of instructions
> out of nft expressions and use gadget that jumps into nft_do_chain().
> The attacker would need to discover more kernel addresses:
> nft_do_chain, nft_cmp_fast_ops, nft_payload_fast_ops, nft_bitwise_eval,
> nft_lookup_eval, and nft_bitmap_lookup
> to populate nft chains, rules and expressions in guest memory
> comparing to bpf interpreter attack.
> 
> Then in nft_do_chain(struct nft_pktinfo *pkt, void *priv)
> pkt needs to point to fake struct sk_buff in guest memory with
> skb->head == target_byte_addr

We don't have a way to make this point to fake struct sk_buff.

> The first nft expression can be nft_payload_fast_eval().
> If it's properly constructed with
> (nft_payload->based == NFT_PAYLOAD_NETWORK_HEADER, offset == 0, len == 0, 
> dreg == 1)

We can reject len == 0. To be honest, this is not done right now, but
we can place a patch to validate this. Given this is a specialized
networking virtual machine, it retain semantics, so fetching zero
length data from a skbuff makes no sense, hence, we can return EINVAL
via netlink when adding a rule that tries to do this.

> it will do arbitrary load of
> *(u8 *)dest = *(u8 *)ptr;
> from target_byte_addr into register 1 of nft state machine
> (dest is u32 array of registers in the stack of nft_do_chain)
> Second nft expression can be nft_bitwise_eval() to mask particular
> bit in register 1.
> Then nft_cmp_eval() to check whether bit is one or zero and
> conditional NFT_BREAK out of first nft expression into second nft rule.
> The last conditional nft_immediate_eval() in the first rule will set
> register 1 to 0x400 * 8 while the first nft_bitwise_eval() in
> the second rule with do r1 &= 0x400 * 8.
> So at this point r1 will have either 0x400 * 8 or 0 depending
> on value of speculatively loaded bit.
> The last expression can be nft_lookup_eval() with 
> nft_lookup->set->ops->lookup == nft_bitmap_lookup
> which will do nft_bitmap->bitmap[idx] where idx = r1 / 8
> The memory used for this last nft_lookup/bitmap expression is
> both an instruction and timing_leak_array itself.
> If I'm not mistaken, this sequence of nft expression will
> speculatively execute very similar logic as in evil_bytecode_instrs[]

My impression is that several assumptions above are not correct.

> The amount of actual speculative native cpu load/stores/branches is
> probably more than executed by bpf interpreter for these evil bytecodes,
> but likely well within cpu speculation window of 100+ insns.
> 
> Obviously such exploit is harder to do than bpf

nft/bpf interpreters and spectre2. Was: [PATCH RFC 0/4] net: add bpfilter

2018-02-21 Thread Alexei Starovoitov

On Wed, Feb 21, 2018 at 01:13:03PM +0100, Florian Westphal wrote:
> 
> Obvious candidates are: meta, numgen, limit, objref, quota, reject.
> 
> We should probably also consider removing
> CONFIG_NFT_SET_RBTREE and CONFIG_NFT_SET_HASH and just always
> build both too (at least rbtree since that offers interval).
> 
> For the indirect call issue we can use direct calls from eval loop for
> some of the more frequently used ones, similar to what we do already
> for nft_cmp_fast_expr. 

nft_cmp_fast_expr and other expressions mentioned above made me thinking...

do we have the same issue with nft interpreter as we had with bpf one?
bpf interpreter was used as part of spectre2 attack to leak
information via cache side channel and let VM read hypervisor memory.
Due to that issue we removed bpf interpreter from the kernel code.
That's what CONFIG_BPF_JIT_ALWAYS_ON for...
but we still have nft interpreter in the kernel that can also
execute arbitrary nft expressions.

Jann's exploit used the following bpf instructions:
struct bpf_insn evil_bytecode_instrs[] = {
// rax = target_byte_addr
{ .code = BPF_LD | BPF_IMM | BPF_DW, .dst_reg = 0, .imm = target_byte_addr }, { 
.imm = target_byte_addr>>32 },
// rdi = timing_leak_array
{ .code = BPF_LD | BPF_IMM | BPF_DW, .dst_reg = 1, .imm = host_timing_leak_addr 
}, { .imm = host_timing_leak_addr>>32 },
// rax = *(u8*)rax
{ .code = BPF_LDX | BPF_MEM | BPF_B, .dst_reg = 0, .src_reg = 0, .off = 0 },
// rax = rax << ...
{ .code = BPF_ALU64 | BPF_LSH | BPF_K, .dst_reg = 0, .imm = 10 - bit_idx },
// rax = rax & 0x400
{ .code = BPF_ALU64 | BPF_AND | BPF_K, .dst_reg = 0, .imm = 0x400 },
// rax = rdi + rax
{ .code = BPF_ALU64 | BPF_ADD | BPF_X, .dst_reg = 0, .src_reg = 1 },
// *(u8*) (rax + 0x800)
{ .code = BPF_LDX | BPF_MEM | BPF_B, .dst_reg = 0, .src_reg = 0, .off = 0x800 },

and a gadget to jump into __bpf_prog_run with insn pointing
to memory controlled by the guest while accessible
(at different virt address) by the hypervisor.

It seems possible to construct similar sequence of instructions
out of nft expressions and use gadget that jumps into nft_do_chain().
The attacker would need to discover more kernel addresses:
nft_do_chain, nft_cmp_fast_ops, nft_payload_fast_ops, nft_bitwise_eval,
nft_lookup_eval, and nft_bitmap_lookup
to populate nft chains, rules and expressions in guest memory
comparing to bpf interpreter attack.

Then in nft_do_chain(struct nft_pktinfo *pkt, void *priv)
pkt needs to point to fake struct sk_buff in guest memory with
skb->head == target_byte_addr
The first nft expression can be nft_payload_fast_eval().
If it's properly constructed with
(nft_payload->based == NFT_PAYLOAD_NETWORK_HEADER, offset == 0, len == 0, dreg 
== 1)
it will do arbitrary load of
*(u8 *)dest = *(u8 *)ptr;
from target_byte_addr into register 1 of nft state machine
(dest is u32 array of registers in the stack of nft_do_chain)
Second nft expression can be nft_bitwise_eval() to mask particular
bit in register 1.
Then nft_cmp_eval() to check whether bit is one or zero and
conditional NFT_BREAK out of first nft expression into second nft rule.
The last conditional nft_immediate_eval() in the first rule will set
register 1 to 0x400 * 8 while the first nft_bitwise_eval() in
the second rule with do r1 &= 0x400 * 8.
So at this point r1 will have either 0x400 * 8 or 0 depending
on value of speculatively loaded bit.
The last expression can be nft_lookup_eval() with 
nft_lookup->set->ops->lookup == nft_bitmap_lookup
which will do nft_bitmap->bitmap[idx] where idx = r1 / 8
The memory used for this last nft_lookup/bitmap expression is
both an instruction and timing_leak_array itself.
If I'm not mistaken, this sequence of nft expression will
speculatively execute very similar logic as in evil_bytecode_instrs[]

The amount of actual speculative native cpu load/stores/branches is
probably more than executed by bpf interpreter for these evil bytecodes,
but likely well within cpu speculation window of 100+ insns.

Obviously such exploit is harder to do than bpf based one.
Do we need to do anything about it ?
May be it's easier to find gadgets in .text of vmlinux
instead of messing with interpreters?

Jann,
can you comment on removing interpreters in general?
Do we need to worry about having bpf and/or nft interpreter
in the kernel?

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-21 Thread Florian Westphal

Pablo Neira Ayuso  wrote:
> On Tue, Feb 20, 2018 at 05:52:54PM -0800, Alexei Starovoitov wrote:
> > On Tue, Feb 20, 2018 at 11:44:31AM +0100, Pablo Neira Ayuso wrote:
> > > 
> > >   Don't get me wrong, no software is safe from security issues, but if you
> > >   don't abstract your resources in the right way, you have more chance to
> > >   have experimence more problems.
> > 
> > interesting point.
> > The key part of iptables and nft design is heavy use of indirect calls.
> > The execution of single iptable rule is ~3 indirect calls.
> > Quite a lot worse in nft where every expression is an indirect call.
> 
> That's right. Netfilter is probably too modular, probably we can
> revisit this to find a better balance, actually Felix Fietkau was
> recently rising concerns on this, specifically in environments with
> limited space to store the kernel image. We'll have a look, thanks for
> remind us about this.

Agree, we have too many config knobs, probably a good idea to turn some
modules into plain .o (like cmp and bitwise).

Obvious candidates are: meta, numgen, limit, objref, quota, reject.

We should probably also consider removing
CONFIG_NFT_SET_RBTREE and CONFIG_NFT_SET_HASH and just always
build both too (at least rbtree since that offers interval).

For the indirect call issue we can use direct calls from eval loop for
some of the more frequently used ones, similar to what we do already
for nft_cmp_fast_expr.  But maybe we don't even have to if we can
get help to build a jitter that takes an nftables netlink table dump
and builds jit code from that.

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-21 Thread Pablo Neira Ayuso

On Tue, Feb 20, 2018 at 05:52:54PM -0800, Alexei Starovoitov wrote:
> On Tue, Feb 20, 2018 at 11:44:31AM +0100, Pablo Neira Ayuso wrote:
> > 
> >   Don't get me wrong, no software is safe from security issues, but if you
> >   don't abstract your resources in the right way, you have more chance to
> >   have experimence more problems.
> 
> interesting point.
> The key part of iptables and nft design is heavy use of indirect calls.
> The execution of single iptable rule is ~3 indirect calls.
> Quite a lot worse in nft where every expression is an indirect call.

That's right. Netfilter is probably too modular, probably we can
revisit this to find a better balance, actually Felix Fietkau was
recently rising concerns on this, specifically in environments with
limited space to store the kernel image. We'll have a look, thanks for
remind us about this.

[...]
> CPUs will eventually be fixed and IBRS_ALL will become reality.
> Until then the kernel has to deal with the performance issues.

Hopefully, so we can all skip these problems.

Thanks.

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-20 Thread Alexei Starovoitov

On Tue, Feb 20, 2018 at 11:44:31AM +0100, Pablo Neira Ayuso wrote:
> 
>   Don't get me wrong, no software is safe from security issues, but if you
>   don't abstract your resources in the right way, you have more chance to
>   have experimence more problems.

interesting point.
The key part of iptables and nft design is heavy use of indirect calls.
The execution of single iptable rule is ~3 indirect calls.
Quite a lot worse in nft where every expression is an indirect call.
If my math is correct even simplest nft rule will get to ~10.
It was all fine until spectre2 was discovered and retpoline
now adds 20-30 cycles for each indirect call.
To put numbers in perspective the simple
for(...)
  indirect_call();
loop without retpoline does ~500 M iterations per second on 2+Ghz xeon.
clang -mretpoline
gcc -mindirect-branch=thunk
gcc -mindirect-branch=thunk-inline
produce slightly different code with performance of 80-90 M
iterations per second for the above loop.

Looks like iptables/nft did not abstract the resources in
the right way and now experiences more problems.

CPUs will eventually be fixed and IBRS_ALL will become reality.
Until then the kernel has to deal with the performance issues.

bpf and the networking stack will suffer from retpoline as well
and we need to work asap on devirtualization and other ideas.
For xdp a single indirect call we do per packet (to call into bpf prog)
is noticeable and we're experimenting with static_key-like approach to
call bpf program with direct call.
bpf_tail_calls will suffer too and cannot be accelerated as-is.
To solve that we're working on dynamic linking via verifier improvements.
C based bpf programs will use normal indirect calls, but verifier will
replace indirect with direct at pointer update time.
It's not going to be easy, but bpf and stack is fixable,
whereas iptables/nft are going to suffer until fixed CPUs find
their way into servers years from now.

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-20 Thread Phil Sutter

Hi Michal,

On Tue, Feb 20, 2018 at 10:35:41AM +0100, Michal Kubecek wrote:
> On Mon, Feb 19, 2018 at 06:09:39PM +0100, Phil Sutter wrote:
> > What puzzles me about your argumentation is that you seem to propose for
> > the kernel to cover up flaws in userspace. Spinning this concept further
> > would mean that if there would be an old bug in iproute2 we should think
> > of adding a workaround to rtnetlink interface in kernel because
> > containers will keep the old iproute2 binary? Or am I (hopefully) just
> > missing your point?
> 
> Actually, that's what we already do. This is from rtnl_dump_ifinfo():
> 
>   /* A hack to preserve kernel<->userspace interface.
>* The correct header is ifinfomsg. It is consistent with rtnl_getlink.
>* However, before Linux v3.9 the code here assumed rtgenmsg and that's
>* what iproute2 < v3.9.0 used.
>* We can detect the old iproute2. Even including the IFLA_EXT_MASK
>* attribute, its netlink message is shorter than struct ifinfomsg.
>*/

The reason why this is in place (and should be IMHO) is that commit
88c5b5ce5cb57 ("rtnetlink: Call nlmsg_parse() with correct header
length") incompatibly changed uAPI.

I have a different example which reflects what I have in mind, namely
iproute2 commit 33f6dd23a51c4 ("ip fou: pass family attribute as u8")
which basically does:

| -   addattr16(n, 1024, FOU_ATTR_AF, family);
| +   addattr8(n, 1024, FOU_ATTR_AF, family);

If kernel cares about those userspace bugs, shouldn't the better fix be
to make it expect u16 in FOU_ATTR_AF and check whether the high or low
byte contains the expected value?

Cheers, Phil

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-20 Thread David Miller

From: Pablo Neira Ayuso 
Date: Tue, 20 Feb 2018 11:44:31 +0100

> * Lack of sufficient abstraction: bpf is not only exposing its own
>   software bugs through its interface, but it will also bite the dust
>   with CPU bugs due to lack of glue code to hide details behind the
>   syscall interface curtain.  That will need a kernel upgrade after all to
>   fix, so all benefits of adding new programs. We've even seem claims on
>   performance being more important than security in this mailing list.
>   Don't get me wrong, no software is safe from security issues, but if you
>   don't abstract your resources in the right way, you have more chance to
>   have experimence more problems.

I find it surprising that the person who didn't even know that
generating classical BPF was not appropriate in his patches is
suddenly a complete expert on eBPF and all of it's shortcomings.

Pablo, I am sincerely very disappointed in you, and if you continue
to attack eBPF in such an ignorant way going forward we will have
a very hard time taking you seriously at all.

Thank you.

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-20 Thread Daniel Borkmann

On 02/20/2018 11:44 AM, Pablo Neira Ayuso wrote:
> Hi David!
> 
> On Mon, Feb 19, 2018 at 12:22:26PM -0500, David Miller wrote:
> [...]
>> Netfilter's chronic performance differential is why a lot of mindshare
>> was lost to userspace networking technologies.
> 
> Claiming that Netfilter is the reason for the massive adoption of
> userspace networking isn't a fair statement at all.
> 
> Let's talk about performance if this is what you want:
> 
> * Our benchmarks here are delivering ~x9.5 performance boost for IPv4
>   load balancing from netfilter ingress.
> 
> * ~x2 faster than iptables prerouting when dropping packets at very
>   early stage in the network datapath - dos attack scenario - again from
>   the ingress hook.
> 
> * The new flowtable infrastructure that will show up in 4.16 provides
>   a faster forwarding path, measuring ~x2 faster forwarding here, _by
>   simply adding one single rule to your FORWARD chain_. And that's
>   just the initial implementation that got merged upstream, we have
>   room to fly even faster.
> 
> And that's just the beginning, we have more ongoing work, incrementally
> based on top of what we have, to provide even faster datapath paths with
> very simple configurations.
> 
> Note that those numbers above are very similar numbers to what we have
> seen in bpf.  Well, to be honest, we're just slightly behind bpf, since
> benchmarks I have seen on loading balancing IPv4 is x10 from XDP,
> dropping packets also slightly more than x2, which is actually happening
> way earlier than ingress, naturally dropping earlier gives us better
> numbers.
> 
> But it's not all about performance... let's have a look at the "iron
> triangle"...
> 
> We keep usability in our radar, that's paramount for us. Netfilter is
> probably so much widely more adopted than tc because of this. The kind

Right, in terms of performance the above is what tc ingress used to do
already long ago after spinlock removal could be lifted, which was an
important step on that direction. In terms of usability, sure, it's always
a 'fun' topic on that matter for a number of classifier / actions mostly
from the older days. I think there it has improved a bit over time,
but at least speaking of things like cls_bpf, it's trivial to attach an
object somewhere via tc cmdline.

> of problems that big Silicon datacenters have to face are simply
> different to the millions of devices running Linux outthere, there are
> plenty of smart devops outthere that sacrify the little performance loss
> at the cost of keeping it easy to configure and maintain things.
> 
> If we want to talk about problems...
> 
> Every project has its own subset of problems. In that sense, anyone that
> has spent time playing with the bpf infrastructure is very much aware of
> all of its usability problems:
> 
> * You have to disable optimizations in llvm, otherwise the verifier
>   gets confused too smart compiler optimizations and rejects the code.

That is actually a false claim, which makes me think that you didn't even
give this a try at all before stating the above. Funny enough, for a very
long period of time in LLVM's BPF back end when you used other optimization
levels than the -O2, clang would bark with an internal error, for example:

  $ clang-3.9 -target bpf -O0 -c foo.c -o /tmp/foo.o
  fatal error: error in backend: Cannot select: 0x5633ae698280: ch,glue = 
BPFISD::CALL 0x5633ae698210, 0x5633ae697e90, Register:i64 %R1, Register:i64 
%R2, Register:i64 %R3,
  0x5633ae698210:1
 0x5633ae697e90: i64,ch = load 0x5633ae6955e0, 
0x5633ae694fc0, undef:i64
  0x5633ae694fc0: i64 = BPFISD::Wrapper TargetGlobalAddress:i64 0
  [...]

Whereas -O2 *is* the general recommendation for everyone to use:

  $ clang-3.9 -target bpf -O2 -c foo.c -o /tmp/foo.o
  $

This is fixed in later versions, e.g. in clang-7.0 such back end error is
gone anyway fwiw. But in any case, we're running complex programs with -O2
optimization levels for several years now just fine. Yes, given we do push
BPF to the limits we had some corner cases where the verifier had to be
adjusted, but overall the number of cases reduced over time, which is also
a natural progression when people use it in various advanced ways. In fact,
it's a much better choice to use clang with -O2 here since simply the majority
of people use it that way. And if you consume it via higher level front ends
e.g. bcc, ply, bpftrace to name a few from tracing side, then you don't need
to care at all about this. (But in addition to that, there's also continuous
effort on LLVM side to optimize BPF code generation in various ways.)

> * Very hard to debug the reason why the verifier is rejecting apparently
>   valid code. That results in people playing strange "moving code around
>   up and down".

Please show me your programs and I'm happy to help you out. :-) Yes, in the
earlier days, I would consider it might have been hard; during the course
of the last few years, the verifier and LLVM back end hav

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-20 Thread Phil Sutter

Hi David,

On Mon, Feb 19, 2018 at 12:15:37PM -0500, David Miller wrote:
> From: Phil Sutter 
> Date: Mon, 19 Feb 2018 18:09:39 +0100
> 
> > What puzzles me about your argumentation is that you seem to propose for
> > the kernel to cover up flaws in userspace. Spinning this concept further
> > would mean that if there would be an old bug in iproute2 we should think
> > of adding a workaround to rtnetlink interface in kernel because
> > containers will keep the old iproute2 binary? Or am I (hopefully) just
> > missing your point?
> 
> I'll answer this with a question.  I tried to remove UFO entirely from
> the kernel, did you see how that went?

:)

I didn't follow back then, but found mails about KVM live migration
breakage when moving to a kernel without UFO. But isn't that a problem
with how virtio_net optimizes things? Florian recently told me how
iptables CHECKSUM target was mainly introduced to overcome a different
problem in the same area. So all this is kernel covering up for kernel
problems. My question was about covering up for userspace bugs in
kernelspace. If you think that is preferable over fixing userspace, I
have to put that in consideration when dealing with userspace issues.

Cheers, Phil

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-20 Thread Pablo Neira Ayuso

Hi David!

On Mon, Feb 19, 2018 at 12:22:26PM -0500, David Miller wrote:
[...]
> Netfilter's chronic performance differential is why a lot of mindshare
> was lost to userspace networking technologies.

Claiming that Netfilter is the reason for the massive adoption of
userspace networking isn't a fair statement at all.

Let's talk about performance if this is what you want:

* Our benchmarks here are delivering ~x9.5 performance boost for IPv4
  load balancing from netfilter ingress.

* ~x2 faster than iptables prerouting when dropping packets at very
  early stage in the network datapath - dos attack scenario - again from
  the ingress hook.

* The new flowtable infrastructure that will show up in 4.16 provides
  a faster forwarding path, measuring ~x2 faster forwarding here, _by
  simply adding one single rule to your FORWARD chain_. And that's
  just the initial implementation that got merged upstream, we have
  room to fly even faster.

And that's just the beginning, we have more ongoing work, incrementally
based on top of what we have, to provide even faster datapath paths with
very simple configurations.

Note that those numbers above are very similar numbers to what we have
seen in bpf.  Well, to be honest, we're just slightly behind bpf, since
benchmarks I have seen on loading balancing IPv4 is x10 from XDP,
dropping packets also slightly more than x2, which is actually happening
way earlier than ingress, naturally dropping earlier gives us better
numbers.

But it's not all about performance... let's have a look at the "iron
triangle"...

We keep usability in our radar, that's paramount for us. Netfilter is
probably so much widely more adopted than tc because of this. The kind
of problems that big Silicon datacenters have to face are simply
different to the millions of devices running Linux outthere, there are
plenty of smart devops outthere that sacrify the little performance loss
at the cost of keeping it easy to configure and maintain things.

If we want to talk about problems...

Every project has its own subset of problems. In that sense, anyone that
has spent time playing with the bpf infrastructure is very much aware of
all of its usability problems:

* You have to disable optimizations in llvm, otherwise the verifier
  gets confused too smart compiler optimizations and rejects the code.

* Very hard to debug the reason why the verifier is rejecting apparently
  valid code. That results in people playing strange "moving code around
  up and down".

* Lack of sufficient abstraction: bpf is not only exposing its own
  software bugs through its interface, but it will also bite the dust
  with CPU bugs due to lack of glue code to hide details behind the
  syscall interface curtain.  That will need a kernel upgrade after all to
  fix, so all benefits of adding new programs. We've even seem claims on
  performance being more important than security in this mailing list.
  Don't get me wrong, no software is safe from security issues, but if you
  don't abstract your resources in the right way, you have more chance to
  have experimence more problems.

Just to mention a few of them.

So, please, let's focus each of us in our own work. Let me remind your
wise words - I think just one year ago in another of these episodes of
the bpf vs. netfilter: "We're all working to achieve the same goals",
even if we're working on competing projects inside Linux.

Thanks!

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-20 Thread Michal Kubecek

On Mon, Feb 19, 2018 at 06:09:39PM +0100, Phil Sutter wrote:
> What puzzles me about your argumentation is that you seem to propose for
> the kernel to cover up flaws in userspace. Spinning this concept further
> would mean that if there would be an old bug in iproute2 we should think
> of adding a workaround to rtnetlink interface in kernel because
> containers will keep the old iproute2 binary? Or am I (hopefully) just
> missing your point?

Actually, that's what we already do. This is from rtnl_dump_ifinfo():

/* A hack to preserve kernel<->userspace interface.
 * The correct header is ifinfomsg. It is consistent with rtnl_getlink.
 * However, before Linux v3.9 the code here assumed rtgenmsg and that's
 * what iproute2 < v3.9.0 used.
 * We can detect the old iproute2. Even including the IFLA_EXT_MASK
 * attribute, its netlink message is shorter than struct ifinfomsg.
 */

(There are, in fact, even current tools using rtgenmsg but that's
another story.)

Michal Kubecek

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread Willem de Bruijn

> I see several possible areas of contention:
>
> 1) If you aim for a non-feature-complete support of iptables rules, it
>will create confusion to the users.

Right, you need full feature parity to be avoid ending up having to
maintain two implementations.

It seems uncontroversial that BPF can be very powerful if run at iptables
hooks. For performance, but also versatility. The android folks are converting
one out-of-tree module to BPF. There is probably a lot more such
business logic out there that is not suitable for inclusion in mainline as
an xt match/target, and that needs more access than xt_bpf can provide.

If a new first-class citizen BPF infra can do this and back the legacy
interface, too, that would save on maintenance. There is a steady
stream of fixes to iptables, e.g., from syzkaller vulnerability reports.
Just keeping the old implementation around as a dead letter is not a
safe deprecation strategy.

To bootstrap bpfilter, in the short term a reasonable set of iptables targets
and matches can perhaps be ported to BPF external functions with some
simple glue code.

> To me, this looks like some kind of legacy backwards compatibility
> mechanism that one would find in proprietary operating systems, but not
> in Linux.  iptables, libiptc etc. are all free software.  The source
> code can be edited, and you could just as well have a new version of
> iptables and/or libiptc which would pass the ruleset in userspace to
> your compiler, which would then insert the resulting eBPF program.
>
> Why add quite comprehensive kerne infrastructure?  What's the motivation
> here?

The ABI deprecation point has been discussed quite a bit. If it is
infeasible to just drop the old interface, then an upcall mechanism
does seem the most practical approach to dynamically generating this
code. FWIW, as BPF is being used in more places, other locations
besides iptables could make use of this.

> Could you please clarify why the 'filter' table INPUT chain was used if
> you're using XDP?  AFAICT they have completely different semantics.
>
> There is a well-conceived and generally understood notion of where
> exactly the filter/INPUT table processing happens.  And that's not as
> early as in the NIC, but it's much later in the processing of the
> packet.
>
> I believe _if_ one wants to use the approach of "hiding" eBPF behind
> iptables, then either
>
> a) the eBPF programs must be executed at the exact same points in the
>stack as the existing hooks of the built-in chains of the
>filter/nat/mangle/raw tables, or
>
> b) you must introduce new 'tables', like an 'xdp' table which then has
>the notion of processing very early in processing, way before the
>normal filter table INPUT processing happens.

Agreed. One of the larger issues in the conversion of the Android
qtaguid conversion was the state surrounding the skb at the time of
processing. This example primarily depended on having skb->sk set.
Whether that is available at tc depends on early decap and even when
set the sk might prove different from the final one in the socket layer in
edge cases. Just one example how moving the call site can be very
fragile wrt state.

Another issue wrt moving around is availability of external functions
at different layers. XDP has access to far fewer than TC. For iptables,
I would imagine that you either want parity with TC or even a new
independent type. Parity would be useful also to expose some xt_match
functionality at the TC layer that is currently missing there.

> My main points are:
>
> 1) What is the goal of this?

My high bit feedback: for cases like taguid, it is very useful to be able
to execute BPF as drop-in at existing iptables locations, as is having
various match and target functionality available from BPF.

Maintaining the legacy ABI is basically dictated. If this can be achieved
while optimizing the runtime path and reducing maintenance that is
very appealing.

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread Jozsef Kadlecsik

Hi David,

On Mon, 19 Feb 2018, Florian Westphal wrote:

> David Miller  wrote:
> > 
> > Florian, first of all, the whole "change the iptables binary" idea is
> > a non-starter.  For the many reasons I have described in the various
> > postings I have made today.
> > 
> > It is entirely impractical.

You stressed several times that container images, virtualization 
installations don't change - and that's exaggregation. Those are updated 
as well, and not only because security updates must be rolled out, but 
because new versions of softwares are requested.

You mentioned that the hosting part can upgrade the kernel - it means that 
enabling NFTABLES is also a non-issue when the new eBPF functionality is 
switched on, if that was missing.

> You suggest:
> 
> iptables -> setsockopt -> umh (xtables -> ebpf) -> kernel
> 
> How is this different from
> 
> iptables -> setsockopt -> umh (Xtables -> nftables -> kernel
> 
> ?
> EBPF can be placed within nftables either userspace or kernel,
> there is nothing that prevents this.

So why the second scenario suggested by Florian is not possible or must be 
avoided? It not only could keep the unmodified iptables in the container 
(if that's a must from some reason), but it would make possible to replace 
it later anytime with iptables-compat/nftables.

Best regards,
Jozsef
-
E-mail  : kad...@blackhole.kfki.hu, kadlecsik.joz...@wigner.mta.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences
  H-1525 Budapest 114, POB. 49, Hungary

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread Florian Westphal

David Miller  wrote:
> From: Phil Sutter 
> Date: Mon, 19 Feb 2018 18:14:11 +0100
> 
> > OK, so reading between the lines you're saying that nftables project
> > has failed to provide an adequate successor to iptables?
> 
> Whilst it is great that the atomic table update problem was solved, I
> think the emphasis on flexibility often at the expense of performance
> was a bad move.

Thats not true, IMO.

One idea previosuly discussed was to add a 'freeze' option
to our nftables syntax.  Essentially what would happen is that further
updates to the table become impossible, with exception of named sets
(which can be changed independently similar to ebpf maps is suppose).

As further updates to the table are then no longer allowed this would
then make it possible to e.g. jit all rules into a single program.

The table could still be removed (and recreated) of course so its
not impossible to make changes, but no longer at the rule level.

> Netfilter's chronic performance differential is why a lot of mindshare
> was lost to userspace networking technologies.

I think this is a unfair statement and also not true.
If you refer to the linear-ruleset-evaluation of iptables, this is
what ipset was added for.

Yes, its a band aid.  But again, that problem come from the UAPI
format/limitations of only having one source or destination address per
rule, a limitation not present in nftables.

Other reason why iptables is a bit more costly than needed (although it
IS rather fast given, no spinlocks in main eval loop) are the rule
counter updates which were built into the design all those years ago.

Again, a problem solved in nftables by making the counters optional.

If you want to speedup forward path with XDP -- fine.
But AFAIU its still possible with XDP to have packets being sent to
full stack, right?

If so, it would be possible to even combine nftables with XDP, f.e.
by allowing an ebpf program running on host CPU to query netfilter
conntrack.

No Entry -> push to normal path
Entry -> check 'fastpath' flag (which would be in nf_conn struct).
Not set -> also normal path.
Otherwise continue XDP, stack bypass.

nftables would have a rule similar to this:
nft add rule inet forward ct state established ct label set fastpath
to switch such conntrack to xdp mode.

This decision can then be combined with nftables infra,
for example 'fatpath for tcp flows that saw more than 1mbit of data
in either direction' or the like.

Yes, this needs ebpf support for conntrack and NAT transformations,
and it does beg question how to handle the details, e.g. conntrack
timeouts.  Don't see any unsolveable issues with this though.

Also has similarities with the 'flow offload' proposal, i.e. we
could perhaps even reuse what we already have to add provide flow
offload in software using epbf/XDP as offload backend.

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread Phil Sutter

Hi David,

On Mon, Feb 19, 2018 at 01:41:29PM -0500, David Miller wrote:
> From: Phil Sutter 
> Date: Mon, 19 Feb 2018 19:05:51 +0100
> 
> > On Mon, Feb 19, 2018 at 12:22:26PM -0500, David Miller wrote:
> >> From: Phil Sutter 
> >> Date: Mon, 19 Feb 2018 18:14:11 +0100
> >> 
> >> > OK, so reading between the lines you're saying that nftables project
> >> > has failed to provide an adequate successor to iptables?
> >> 
> >> Whilst it is great that the atomic table update problem was solved, I
> >> think the emphasis on flexibility often at the expense of performance
> >> was a bad move.
> > 
> > I don't see a lack of performance in nftables when being compared to
> > iptables (as we have now). From my point of view, it's quite the
> > contrary: nftables did a great job in picking up iptables performance
> > afterthoughts (e.g. ipset) and leveraging that to the max(TM) (verdict
> > maps, concatenated set entries). Assuming the virtual machine design
> > principle isn't just marketing but sets the course for JIT ruleset
> > optimizations, there's some margin as well.
> > 
> > So from my perspective, one should say nftables increased flexibility
> > without sacrificing performance.
> 
> I did not say nftables adjusted performance one way or another.  It kept
> it on the same order of magnitude.  And this is a design decision.

Oh, seems I missed your point then. What subject did you have in mind
when you wrote "emphasis on flexibility often at the expense of
performance"? I thought you were talking about nftables.

> > Yes, even with my limited experience I noticed that there is quite some
> > demand for even faster packet processing in Linux, mostly for rather
> > custom scenarios like forwarding into containers/VMs. Though my point
> > was about general purpose firewalling abilities in Linux, say people
> > securing their desktop or maintaining networks with less demands on
> > performance.
> 
> I've always stated that low power, low end, systems are just a good
> place for high performance filtering as high end ones.

Do you think these systems are likely to receive a NIC (or some sort of
co-processor) which allows for offloading eBPF to? Maybe I miss the
point again, but this is the only argument for bpfilter over nftables -
and that only if one ignores the option to implement an eBPF backend for
nftables VM). OK, maybe this clarifies once I know what you had in mind
when you wrote that reply.

Cheers, Phil

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread David Miller

From: Harald Welte 
Date: Mon, 19 Feb 2018 19:37:30 +0100

> I was speaking of actual *users* as in indiiduals running their own
> systems, companies running their own servers/datacenter.  The fact that
> some ISP (or its supplier) decisdes that one of my IP packets is routed
> via a smartnic with XDP offloading somewhere is great, but still doesn't
> turn me into a "user" of that technology.  Not in my linke of thinking,
> at least.

I am sorry that our opinions differ.

I must consider all users of Linux both direct and indirect, to
determine impact and where resources and efforts should be allocated.

>> And by in large, for system tracing and analysis eBPF is basically
>> a hard requirement for people doing anything serious these days.
> 
> That's great, but misses the point.  I was referring to usage in the
> context of the kernel network stack.  Sorry for not being explicit
> enough.

And that misses the point entirely.

Which is that eBPF is more than just networking, so it is missing
that this technology is not just networking specific but a kernel
wide one that is being adopted in every nook and cranny of the
kernel.

> Sure, one data center / hosting / "cloud" provider can quickly roll out
> a change in their network.  But I'm referring to significant,
> (Linux-)industry-wide adoption.

Hehe, I guess whatever definition works for the position you are
trying to take.

:-)

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread David Miller

From: Arturo Borrero Gonzalez 
Date: Mon, 19 Feb 2018 19:06:12 +0100

> Yes, probably major datacenters (google? facebook?, amazon?) they
> don't even care about what Debian is doing, since they are crafting
> their own distro anyway.  But there are *a lot* of other people that
> do care about these migration plans.

"Lots" is a big word that gets thrown around quite carelessly.

What do you imagine is the order of magnitude of cloud and big
datacenter server system deployments vs. individual servers and
whatnot?

I have to take into consideration what will really have the largest
impact on the largest number of people, and I am pretty sure I know
where that lies.

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread David Miller

From: Phil Sutter 
Date: Mon, 19 Feb 2018 19:05:51 +0100

> Hi David,
> 
> On Mon, Feb 19, 2018 at 12:22:26PM -0500, David Miller wrote:
>> From: Phil Sutter 
>> Date: Mon, 19 Feb 2018 18:14:11 +0100
>> 
>> > OK, so reading between the lines you're saying that nftables project
>> > has failed to provide an adequate successor to iptables?
>> 
>> Whilst it is great that the atomic table update problem was solved, I
>> think the emphasis on flexibility often at the expense of performance
>> was a bad move.
> 
> I don't see a lack of performance in nftables when being compared to
> iptables (as we have now). From my point of view, it's quite the
> contrary: nftables did a great job in picking up iptables performance
> afterthoughts (e.g. ipset) and leveraging that to the max(TM) (verdict
> maps, concatenated set entries). Assuming the virtual machine design
> principle isn't just marketing but sets the course for JIT ruleset
> optimizations, there's some margin as well.
> 
> So from my perspective, one should say nftables increased flexibility
> without sacrificing performance.

I did not say nftables adjusted performance one way or another.  It kept
it on the same order of magnitude.  And this is a design decision.

> Yes, even with my limited experience I noticed that there is quite some
> demand for even faster packet processing in Linux, mostly for rather
> custom scenarios like forwarding into containers/VMs. Though my point
> was about general purpose firewalling abilities in Linux, say people
> securing their desktop or maintaining networks with less demands on
> performance.

I've always stated that low power, low end, systems are just a good
place for high performance filtering as high end ones.

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread Harald Welte

Hi David,

On Mon, Feb 19, 2018 at 12:29:08PM -0500, David Miller wrote:
> People with an Android phone in their pocket is using iptables, and
> the overhead and performance of those rules really does matter.  It
> determines how long your battery life is, etc.

I am not the android expert.  However, I just dumped the ruleset on my
Galaxy Tab S2 (Android 7.1.2 / LineageOS), and it was a whooping 91
rules across all tables.  The longest chain interation I could spot
was 24 rules.  That's not the kind of ruleset where I would expect
performance worries.

And if there was, nftables is around for quite some time and would be
much faster.

Sure, that was just one tablet, but I wonder how much Android packet
filter performance issue there are.  Would be interesting to hear about
those (and on whether they benchmarked against nftables).

> > I can just as well ask how many millions of users / devices are
> > already using eBPF or XDP?
> 
> Every time someone connects to a major provider, they are using it.

I was speaking of actual *users* as in indiiduals running their own
systems, companies running their own servers/datacenter.  The fact that
some ISP (or its supplier) decisdes that one of my IP packets is routed
via a smartnic with XDP offloading somewhere is great, but still doesn't
turn me into a "user" of that technology.  Not in my linke of thinking,
at least.

> And by in large, for system tracing and analysis eBPF is basically
> a hard requirement for people doing anything serious these days.

That's great, but misses the point.  I was referring to usage in the
context of the kernel network stack.  Sorry for not being explicit
enough.

Also, the entire point was about "new technologies need time to be
adopted widely".  Doesn't matter which new kernel feature that is.

Sure, one data center / hosting / "cloud" provider can quickly roll out
a change in their network.  But I'm referring to significant,
(Linux-)industry-wide adoption.  That would first include major
distributions to include/enable/support the feature, and then people
actually building their systems/products/software on top of those.

> Please see the wonderful work by Brendan Gregg and others which has
> basically made the GPL'ing of DTrace by Oracle entirely irrelevant and
> our Linux's tracing infrastructure has become must more powerful and
> capable thanks to eBPF.

Agreed.

-- 
- Harald Weltehttp://laforge.gnumonks.org/

"Privacy in residential applications is a desirable marketing option."
  (ETSI EN 300 175-7 Ch. A6)

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread Arturo Borrero Gonzalez

On 19 February 2018 at 16:36, David Miller  wrote:
>
> In my opinion, any resistence to integration with eBPF and XDP will
> lead to even less adoption of netfilter as a technology.
>
> Therefore my plan is to move everything to be integrated around these
> important core technologies.  For the purposes of integration, code
> coverage, performance, and the ability to juxtapose different bits of
> eBPF code into larger optimized code streams that can also be
> offloaded into hardware.

Thanks for sharing your plans. I'll share mine.

Debian already recommends using nftables rather than iptables.
Probably in the next release cycle we (Debian) will give even more
prominence to nftables by linking iptables to iptables-compat, as an
opt-in for users, so we don't break systems.
By the next-next release cycle (4+ years or so?) we will probably have
enough confidence with compat/translation tools that Debian could
fully wipe the old iptables binary to use just the nftables framework.
Same for ip6tables, arptables, ebtables.

Does this sound reasonable to you?

Yes, probably major datacenters (google? facebook?, amazon?) they
don't even care about what Debian is doing, since they are crafting
their own distro anyway.
But there are *a lot* of other people that do care about these migration plans.

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread Phil Sutter

Hi David,

On Mon, Feb 19, 2018 at 12:22:26PM -0500, David Miller wrote:
> From: Phil Sutter 
> Date: Mon, 19 Feb 2018 18:14:11 +0100
> 
> > OK, so reading between the lines you're saying that nftables project
> > has failed to provide an adequate successor to iptables?
> 
> Whilst it is great that the atomic table update problem was solved, I
> think the emphasis on flexibility often at the expense of performance
> was a bad move.

I don't see a lack of performance in nftables when being compared to
iptables (as we have now). From my point of view, it's quite the
contrary: nftables did a great job in picking up iptables performance
afterthoughts (e.g. ipset) and leveraging that to the max(TM) (verdict
maps, concatenated set entries). Assuming the virtual machine design
principle isn't just marketing but sets the course for JIT ruleset
optimizations, there's some margin as well.

So from my perspective, one should say nftables increased flexibility
without sacrificing performance.

> Netfilter's chronic performance differential is why a lot of mindshare
> was lost to userspace networking technologies.
> 
> Thankfully, we are gaining back a lot of that userbase with XDP and
> eBPF, thanks to the hard work of many individuals.
> 
> To think that people are going to be willing to take the performance
> hit (whatever it's size) to go back to the "more flexible" nftables
> is really not a realistic expectation.
> 
> And we have amassed enough interest and momentum that offloading eBPF
> in hardware on current and future hardware is happening.
> 
> So I am going to direct us in directions that allow those realities to
> be taken advantage of, rather than pretending that this transition
> hasn't occurred already.

Hey, you secretly changed the topic! ;)

Yes, even with my limited experience I noticed that there is quite some
demand for even faster packet processing in Linux, mostly for rather
custom scenarios like forwarding into containers/VMs. Though my point
was about general purpose firewalling abilities in Linux, say people
securing their desktop or maintaining networks with less demands on
performance. I guess it will be a while until consumer hardware comes
with smart NICs (or they become affordable), so for those people
nftables is definitely a step forward.

Cheers, Phil

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread Arturo Borrero Gonzalez

On 19 February 2018 at 16:27, David Miller  wrote:
> From: Florian Westphal 
> Date: Mon, 19 Feb 2018 16:15:55 +0100
>
>> Would you be willing to merge nftables into kernel tools directory
>> then?
>
> Did you miss the part where I explained that people explicitly disable
> NFTABLES in their kernel configs in most if not all large datacenters?

hey, you already shared several statements regarding nftables which
are not true.

Lots and lots of people are using distribution kernels, which contains
NF_TABLES config enabled (all major distros have it)
I believe people who build their own kernels are very few if you
compare with the number of people who don't (but yeah, they usually
have more money).
This may sounds as a joke, but there are *a lot* of people running
productions servers with bluetooth drivers enabled in the kconfig.

So, I can confirm that:
Lots of people and institutions are using nftables already.
Lots of people and institutions are considering transition to nftables
it from iptables.
Lots of people are running simple commodity hardware and know nothing
about smartnics or any kind of offloading

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread Arturo Borrero Gonzalez

On 19 February 2018 at 16:36, David Miller  wrote:
>
> I think netfilter is at a real crossroads right now.
>

I don't think so. The Netfilter Project and the Netfilter Community
already "agreed" on nftables and we are working on it.
But this isn't a secret, right? We have been open-discussing and
open-working on this for *years* now.

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread Harald Welte

Hi David,

On Mon, Feb 19, 2018 at 10:31:39AM -0500, David Miller wrote:
> > Why is it practical to replace your kernel but not practical to replace
> > a small userspace tool running on top of it?
> 
> The container is just userspace components.  Those are really baked in
> and are never changing.

never until you have to apply a bug fix for any of the many components you bake
into it.  I am doing this on an (at least) weekly basis for my Docker 
containers.
That's no different from a classic Linux distribution where you update your 
apt/rpm
packages all the time.

A container that is static and cannot continuously updated with new versions
for security (and other) fixes is broken by design.  If some people are doing
this, they IMHO have no sense of IT security, and such usage pattersn are not
what kernel development should cite as primary use case (again IMHO).

> This is how cloud hosting environments work.

Yes, *one* particular use case.  By far not every use case of Linux, or
Linux packet filtering.

-- 
- Harald Weltehttp://laforge.gnumonks.org/

"Privacy in residential applications is a desirable marketing option."
  (ETSI EN 300 175-7 Ch. A6)

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread David Miller

From: Harald Welte 
Date: Mon, 19 Feb 2018 18:20:40 +0100

> It's like with any migration.  People were using ipchains for a long
> time even after iptables existed.  Many people simply don't care
> about packet filter performance.  It's only a small fraction of their
> entire CPU workload, so probably not worth optimzing.  For dedicated
> firewall devices, that's of course a different story.

"I have power in my house, what's the big deal about this power
outage I hear about?"

People with an Android phone in their pocket is using iptables, and
the overhead and performance of those rules really does matter.  It
determines how long your battery life is, etc.

> I can just as well ask how many millions of users / devices are
> already using eBPF or XDP?

Every time someone connects to a major provider, they are using it.

And by in large, for system tracing and analysis eBPF is basically
a hard requirement for people doing anything serious these days.

Please see the wonderful work by Brendan Gregg and others which has
basically made the GPL'ing of DTrace by Oracle entirely irrelevant and
our Linux's tracing infrastructure has become must more powerful and
capable thanks to eBPF.

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread David Miller

From: Phil Sutter 
Date: Mon, 19 Feb 2018 18:14:11 +0100

> OK, so reading between the lines you're saying that nftables project
> has failed to provide an adequate successor to iptables?

Whilst it is great that the atomic table update problem was solved, I
think the emphasis on flexibility often at the expense of performance
was a bad move.

Netfilter's chronic performance differential is why a lot of mindshare
was lost to userspace networking technologies.

Thankfully, we are gaining back a lot of that userbase with XDP and
eBPF, thanks to the hard work of many individuals.

To think that people are going to be willing to take the performance
hit (whatever it's size) to go back to the "more flexible" nftables
is really not a realistic expectation.

And we have amassed enough interest and momentum that offloading eBPF
in hardware on current and future hardware is happening.

So I am going to direct us in directions that allow those realities to
be taken advantage of, rather than pretending that this transition
hasn't occurred already.

Thank you.

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread Harald Welte

Hi David,

On Mon, Feb 19, 2018 at 10:36:51AM -0500, David Miller wrote:

> nftables has been proported as "better" for years, yet large
> institutions did not migrate to it.  In fact, they explicitly
> disabled NFTABLES in their kernel config.

It's like with any migration.  People were using ipchains for a long
time even after iptables existed.  Many people simply don't care
about packet filter performance.  It's only a small fraction of their
entire CPU workload, so probably not worth optimzing.  For dedicated
firewall devices, that's of course a different story.

How long did it take for the getrandom() system call to be actually used
by applications [even glibc!]?  Or many other things that get introduced
in the kernel?

I can just as well ask how many millions of users / devices are already
using eBPF or XDP? How many major Linux distributions are enabling
and/or supporting this yet?  I'm not criticizing, I'm just attempting
to illustrate that technologies always take time to establish
themselves - and of course those people with the biggest benefit (and
knowing about it) will be the early adopters, while many others have no
motivation to migrate.

> In my opinion, any resistence to integration with eBPF and XDP will
> lead to even less adoption of netfilter as a technology.

1) I may not have made my point clear, sorry.  I have not argued against
   any integration with eBPF, I have just made some specific arguments
   against specific aspects of the current RFC.

2) You have indicated repeatedly that there are millions and millions of
   netfilter/iptables users out there.  So I fail to see the "even less
   adoption" part.  "Even less" than those millions and millions? SCNR.

Regards,
Harald
-- 
- Harald Weltehttp://laforge.gnumonks.org/

"Privacy in residential applications is a desirable marketing option."
  (ETSI EN 300 175-7 Ch. A6)

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread David Miller

From: Phil Sutter 
Date: Mon, 19 Feb 2018 18:09:39 +0100

> What puzzles me about your argumentation is that you seem to propose for
> the kernel to cover up flaws in userspace. Spinning this concept further
> would mean that if there would be an old bug in iproute2 we should think
> of adding a workaround to rtnetlink interface in kernel because
> containers will keep the old iproute2 binary? Or am I (hopefully) just
> missing your point?

I'll answer this with a question.  I tried to remove UFO entirely from
the kernel, did you see how that went?

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread Phil Sutter

Hi David,

On Mon, Feb 19, 2018 at 10:44:59AM -0500, David Miller wrote:
> From: Harald Welte 
> Date: Mon, 19 Feb 2018 16:38:08 +0100
> 
> > On Mon, Feb 19, 2018 at 10:27:27AM -0500, David Miller wrote:
> >> > Would you be willing to merge nftables into kernel tools directory
> >> > then?
> >> 
> >> Did you miss the part where I explained that people explicitly disable
> >> NFTABLES in their kernel configs in most if not all large datacenters?
> > 
> > If people to chose to disable a certain feature, then that is their own
> > decision to do so.  We should respect that decision.  Clearly they seem
> > to have no interest in a better or more featureful packet filter, then.
> > 
> > I mean, it's not like somebody proposes to implement NTFS inside the FAT
> > filesystem kernel module because distributors (or data centers) tend to
> > disable the NTFS module?!
> > 
> > How is kernel development these days constrained by what some users may
> > or may not put in their Kconfig?  If they want a given feature, they
> > must enable it.
> 
> This discussion was about why iptables UABI still matters.
> 
> And I'm trying to explain to you one of several reasons why it does.
> 
> Also, instead of saying "They decided to not use NFTABLES, oh well
> that is their problem" it might be more beneficial, especially in the
> long term for netfilter, to think about "why".

OK, so reading between the lines you're saying that nftables project has
failed to provide an adequate successor to iptables?

Cheers, Phil

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread Phil Sutter

Hi David,

On Mon, Feb 19, 2018 at 10:31:39AM -0500, David Miller wrote:
> From: Harald Welte 
> Date: Mon, 19 Feb 2018 16:27:46 +0100
> 
> > On Mon, Feb 19, 2018 at 10:13:35AM -0500, David Miller wrote:
> > 
> >> Florian, first of all, the whole "change the iptables binary" idea is
> >> a non-starter.  For the many reasons I have described in the various
> >> postings I have made today.
> >> 
> >> It is entirely impractical.
> > 
> > Why is it practical to replace your kernel but not practical to replace
> > a small userspace tool running on top of it?
> 
> The container is just userspace components.  Those are really baked in
> and are never changing.

Which is a problem per se. Cheap hardware routers are a good example of
why business models which tend to get customers stuck with old software
have such dramatic effects at least in matters of security.

> The hosting element, on the other hand, can upgrade the kernel in that
> scenerio no problem.
> 
> This is how cloud hosting environments work.

What puzzles me about your argumentation is that you seem to propose for
the kernel to cover up flaws in userspace. Spinning this concept further
would mean that if there would be an old bug in iproute2 we should think
of adding a workaround to rtnetlink interface in kernel because
containers will keep the old iproute2 binary? Or am I (hopefully) just
missing your point?

Cheers, Phil

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread David Miller

From: Harald Welte 
Date: Mon, 19 Feb 2018 16:38:08 +0100

> On Mon, Feb 19, 2018 at 10:27:27AM -0500, David Miller wrote:
>> > Would you be willing to merge nftables into kernel tools directory
>> > then?
>> 
>> Did you miss the part where I explained that people explicitly disable
>> NFTABLES in their kernel configs in most if not all large datacenters?
> 
> If people to chose to disable a certain feature, then that is their own
> decision to do so.  We should respect that decision.  Clearly they seem
> to have no interest in a better or more featureful packet filter, then.
> 
> I mean, it's not like somebody proposes to implement NTFS inside the FAT
> filesystem kernel module because distributors (or data centers) tend to
> disable the NTFS module?!
> 
> How is kernel development these days constrained by what some users may
> or may not put in their Kconfig?  If they want a given feature, they
> must enable it.

This discussion was about why iptables UABI still matters.

And I'm trying to explain to you one of several reasons why it does.

Also, instead of saying "They decided to not use NFTABLES, oh well
that is their problem" it might be more beneficial, especially in the
long term for netfilter, to think about "why".

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread David Miller

From: Jan Engelhardt 
Date: Mon, 19 Feb 2018 16:37:57 +0100 (CET)

> On Monday 2018-02-19 16:32, David Miller wrote:
> 
>>From: Harald Welte 
>>Date: Mon, 19 Feb 2018 16:23:21 +0100
>>
>>> Also, as long as legacy ip_tables/x_tables is still in the kernel, you
>>> can still run your old userspace against that old implementation in the
>>> kernel.
>>
>>But without offloading, and the various other benefits which I have
>>tried to clearly explain to both you and Florian.
> 
> Which is actually the business model to get people *off* the old ABI in 
> reasonable time.

Hosting companies can't change what customers run in their containers.

But if they are told that a kernel upgrade will get them offloading
and increase their performance termendously, then that gives them real
value.

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread Harald Welte

Dear David,

On Mon, Feb 19, 2018 at 10:27:27AM -0500, David Miller wrote:
> > Would you be willing to merge nftables into kernel tools directory
> > then?
> 
> Did you miss the part where I explained that people explicitly disable
> NFTABLES in their kernel configs in most if not all large datacenters?

If people to chose to disable a certain feature, then that is their own
decision to do so.  We should respect that decision.  Clearly they seem
to have no interest in a better or more featureful packet filter, then.

I mean, it's not like somebody proposes to implement NTFS inside the FAT
filesystem kernel module because distributors (or data centers) tend to
disable the NTFS module?!

How is kernel development these days constrained by what some users may
or may not put in their Kconfig?  If they want a given feature, they
must enable it.

-- 
- Harald Weltehttp://laforge.gnumonks.org/

"Privacy in residential applications is a desirable marketing option."
  (ETSI EN 300 175-7 Ch. A6)

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread Jan Engelhardt

On Monday 2018-02-19 16:32, David Miller wrote:

>From: Harald Welte 
>Date: Mon, 19 Feb 2018 16:23:21 +0100
>
>> Also, as long as legacy ip_tables/x_tables is still in the kernel, you
>> can still run your old userspace against that old implementation in the
>> kernel.
>
>But without offloading, and the various other benefits which I have
>tried to clearly explain to both you and Florian.

Which is actually the business model to get people *off* the old ABI in 
reasonable time. Otherwise, we would have to ask ourselves why we have 
not yet enhanced /dev/raw with mmap and whatnot.

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread David Miller

From: Harald Welte 
Date: Mon, 19 Feb 2018 16:23:21 +0100

>> Like it or not iptables ABI based filtering is going to be in the data
>> path for many years if not a decade or more to come.  
> 
> I beg to differ.  For some people, yes.  but then, as Florian points
> out, they can just as well use the existing x_tables kernel code.  If
> they want something better, they can either replace their iptables
> program with xtables-compat from nftables, or whatever else might
> exist for eBPF support.

nftables has been proported as "better" for years, yet large
institutions did not migrate to it.  In fact, they explicitly
disabled NFTABLES in their kernel config.

You may want to ponder for a little while why that might be.

I think netfilter is at a real crossroads right now.

In my opinion, any resistence to integration with eBPF and XDP will
lead to even less adoption of netfilter as a technology.

Therefore my plan is to move everything to be integrated around these
important core technologies.  For the purposes of integration, code
coverage, performance, and the ability to juxtapose different bits of
eBPF code into larger optimized code streams that can also be
offloaded into hardware.

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread David Miller

From: Harald Welte 
Date: Mon, 19 Feb 2018 16:23:21 +0100

> Also, as long as legacy ip_tables/x_tables is still in the kernel, you
> can still run your old userspace against that old implementation in the
> kernel.

But without offloading, and the various other benefits which I have
tried to clearly explain to both you and Florian.

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread David Miller

From: Harald Welte 
Date: Mon, 19 Feb 2018 16:27:46 +0100

> On Mon, Feb 19, 2018 at 10:13:35AM -0500, David Miller wrote:
> 
>> Florian, first of all, the whole "change the iptables binary" idea is
>> a non-starter.  For the many reasons I have described in the various
>> postings I have made today.
>> 
>> It is entirely impractical.
> 
> Why is it practical to replace your kernel but not practical to replace
> a small userspace tool running on top of it?

The container is just userspace components.  Those are really baked in
and are never changing.

The hosting element, on the other hand, can upgrade the kernel in that
scenerio no problem.

This is how cloud hosting environments work.

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread Harald Welte

Hi David,

On Mon, Feb 19, 2018 at 10:13:35AM -0500, David Miller wrote:

> Florian, first of all, the whole "change the iptables binary" idea is
> a non-starter.  For the many reasons I have described in the various
> postings I have made today.
> 
> It is entirely impractical.

Why is it practical to replace your kernel but not practical to replace
a small userspace tool running on top of it?

-- 
- Harald Weltehttp://laforge.gnumonks.org/

"Privacy in residential applications is a desirable marketing option."
  (ETSI EN 300 175-7 Ch. A6)

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread Harald Welte

Hi David,

On Mon, Feb 19, 2018 at 09:44:51AM -0500, David Miller wrote:
> I see talk about "just replacing the iptables binary".
> 
> A long time ago in a galaxy far far away, that would be a reasonable
> scheme.  But that kind of approach won't work in the realities of
> today.
>
> You aren't going to be able to replace the iptables binary in the tens
> of thousands of container images out there, nor the virtualization
> installations either.

I appear to have been under the impression that the entire movement to
DevOps and automatic provisioning of containers/nodes/pods/Vms with
puppet, ansible, Dockerfiles & Co is to be *more* agile in deployments,
rather than less.

If you cannot even rebuild your thousands of container images with
updated binaries, then what is this all worth?  You need to be able to
update the iptables (or any other) binary in case there's an important
(security or otherwise) bug that needs fixing.  I don't see this any
different.

Also, as long as legacy ip_tables/x_tables is still in the kernel, you
can still run your old userspace against that old implementation in the
kernel.  Nobody forces you to use anything else [for another decade or
so].  Just if you want to take advantage of new more
modular/performant/... things like nftables or an eBPF backend, then you
would have to go that extra mile.

I don't think the kernel (network) developers should burden themselves
with too many things.  There's sufficient on their plate as-is.  So 
* if there's some new system (nftables, bpfilter, ...)
* and some documented migration paths for the vast majority of the use
  cases (replacing iptables binaries with a compat wrapper)
* and the old system continues to work as-is (x_tables kernel code stays for
  several more years)

Then people who care about the new features or performance will migrate
to the new system.  And those who don't care stay with the old system -
which is not a problem as they clearly wouldn't need the new system
anyway.

> Like it or not iptables ABI based filtering is going to be in the data
> path for many years if not a decade or more to come.  

I beg to differ.  For some people, yes.  but then, as Florian points
out, they can just as well use the existing x_tables kernel code.  If
they want something better, they can either replace their iptables
program with xtables-compat from nftables, or whatever else might
exist for eBPF support.

> iptables is a victim of it's own success, like it or not :-) Yes, the
> ABI is terrible, but obviously it was useful enough for lots of
> people.

and it continues to do so.  I just don't think it is a great idea
to kludge any new packet filter against such an arcane uapi.

> Therefore it behooves us to accept this reality and align the data
> path generated to match what the rest of the kernel is moving towards
> and that is eBPF and XDP.

This argument is unrelated to the question of the uapi. I'm not arguing
against an eBPF backend/implementation for packet filtering.  It's more
a question of _how_.

> Furthrmore, on a long term maintainence perspective, it means that
> every data path used by the kernel for iptables will be fully verified
> by the eBPF verifier.  This means that the iptables data path will be
> guaranteed to never get into a loop, access out of bounds data, etc.
> 
> That to me is real power, and something we should pursue.

Once again, both not related to the question of the uapi.

> I know you can't see how offloading is possible, but I hope
> are some further discussion you can see how that might work.

I'm looking forward to that point.

Regards,
Harald

-- 
- Harald Weltehttp://laforge.gnumonks.org/

"Privacy in residential applications is a desirable marketing option."
  (ETSI EN 300 175-7 Ch. A6)

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread David Miller

From: Florian Westphal 
Date: Mon, 19 Feb 2018 16:20:23 +0100

> See my other mail, where I explained, in great detail, the problems
> of the xtables UAPI.

As the person who wrote the bpfilter UAPI parser for this, you don't
need to explain this to me.

But it's not going anywhere, and is used by millions upon millions of
users.

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread David Miller

From: Florian Westphal 
Date: Mon, 19 Feb 2018 16:15:55 +0100

> Would you be willing to merge nftables into kernel tools directory
> then?

Did you miss the part where I explained that people explicitly disable
NFTABLES in their kernel configs in most if not all large datacenters?

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread Florian Westphal

David Miller  wrote:
> From: Florian Westphal 
> Date: Mon, 19 Feb 2018 15:53:14 +0100
> 
> > Sure, but looking at all the things that were added to iptables
> > to alleviate some of the issues (ipset for instance) show that we need a
> > meaningful re-design of how things work conceptually.
> 
> As you said iptables is in maintainenance mode.
> 
> But there are millions upon millions of users, like it or not, and
> they aren't going away for decades.  And this is the iptables binary
> ABI I'm talking about, not the iptables user command line interface.

I know.

> my house?"  Please see further than the view inside your home. 
> 
> By in large, we are stuck with iptables's data path for an extremely
> long time.

So?

> Major data centers doesn't even enable NFTABLES in their kernels, and
> there is nothing you can do about that in the short to medium term.

So?

> Therefore, for all of the beneficial reasons I have discussed we
> should make that datapath as aligned and integrated with our core
> important technologies as possible, so that they can benefit from any
> and all improvements in that area rather than just collecting dust.

See my other mail, where I explained, in great detail, the problems
of the xtables UAPI.

If you go through with this, and, eventually somehow get feature parity,
all of the problems remain in full effect.
You will also need to replicate the translation efforts that already
went into nftables.  The translator wasn't yet a high priority as we
lacked some features but this can be changed now that nft is catching
up.

Userspace program expectation is for iptables to be like fib for
instance, i.e. you can add and remove without stomping on each others
feet.  You are setting this in stone.

You're also adding a way to make it so that I can delete entries from
the fib (bpfilter) but iproute2 will still show all entries (iptables
legacy).

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread Florian Westphal

David Miller  wrote:
> From: Florian Westphal 
> Date: Mon, 19 Feb 2018 15:59:35 +0100
> 
> > David Miller  wrote:
> >> It also means that the scope of developers who can contribute and work
> >> on the translater is much larger.
> > 
> > How so?  Translator is in userspace in nftables case too?
> 
> Florian, first of all, the whole "change the iptables binary" idea is
> a non-starter.  For the many reasons I have described in the various
> postings I have made today.
> 
> It is entirely impractical.

???
You suggest:

iptables -> setsockopt -> umh (xtables -> ebpf) -> kernel

How is this different from

iptables -> setsockopt -> umh (Xtables -> nftables -> kernel

?
EBPF can be placed within nftables either userspace or kernel,
there is nothing that prevents this.

> Anything designed in that nature must be distributed completely in the
> kernel tree, so that the iptables kernel ABI is provided without any
> externel dependencies.

Would you be willing to merge nftables into kernel tools directory then?

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread David Miller

From: Florian Westphal 
Date: Mon, 19 Feb 2018 15:59:35 +0100

> David Miller  wrote:
>> It also means that the scope of developers who can contribute and work
>> on the translater is much larger.
> 
> How so?  Translator is in userspace in nftables case too?

Florian, first of all, the whole "change the iptables binary" idea is
a non-starter.  For the many reasons I have described in the various
postings I have made today.

It is entirely impractical.

So we are strictly talking about the code we are writing to translate
iptables ABI (in the kernel) into an eBPF based datapath.

Anything designed in that nature must be distributed completely in the
kernel tree, so that the iptables kernel ABI is provided without any
externel dependencies.

We could have done the translater in in the kernel, but instead we are
doing it with a userland component.

And that's what we are talking about.

Thank you.

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread David Miller

From: Florian Westphal 
Date: Mon, 19 Feb 2018 15:53:14 +0100

> Sure, but looking at all the things that were added to iptables
> to alleviate some of the issues (ipset for instance) show that we need a
> meaningful re-design of how things work conceptually.

As you said iptables is in maintainenance mode.

But there are millions upon millions of users, like it or not, and
they aren't going away for decades.  And this is the iptables binary
ABI I'm talking about, not the iptables user command line interface.

These discussions about nftables migrations sound like a person near a
power outage who exclaims: "What's the big deal, the lights are on in
my house?"  Please see further than the view inside your home. 

By in large, we are stuck with iptables's data path for an extremely
long time.

Major data centers doesn't even enable NFTABLES in their kernels, and
there is nothing you can do about that in the short to medium term.

Therefore, for all of the beneficial reasons I have discussed we
should make that datapath as aligned and integrated with our core
important technologies as possible, so that they can benefit from any
and all improvements in that area rather than just collecting dust.

Thank you.

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread Florian Westphal

David Miller  wrote:
> From: Daniel Borkmann 
> Date: Mon, 19 Feb 2018 13:03:17 +0100
> 
> > Thought was that it would be more suitable to push all the complexity of
> > such translation into user space which brings couple of additional 
> > advantages
> > as well: the translation can become very complex and thus it would contain
> > all of it behind syscall boundary where natural path of loading programs
> > would go via verifier. Given the tool would reside in user space, it would
> > also allow to ease development and testing can happen w/o recompiling the
> > kernel. It would allow for all the clang sanitizers to run there and for
> > having a comprehensive test suite to verify and dry test translations 
> > against
> > traffic test patterns (e.g. bpf infra would provide possibilities on this
> > w/o complex setup). Given normal user mode helpers make this rather painful
> > since they need to be shipped as extra package by the various distros, the
> > idea was that the module loader back end could treat umh similarly as kernel
> > modules and hook them in through request_module() approach while still
> > operating out of user space. In any case, I could image this approach might
> > be interesting and useful in general also for other subsystems requiring
> > umh in one way or another.
> 
> Yes, this is a very powerful new facility.
> 
> It also means that the scope of developers who can contribute and work
> on the translater is much larger.

How so?  Translator is in userspace in nftables case too?

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread David Miller

From: Daniel Borkmann 
Date: Mon, 19 Feb 2018 13:03:17 +0100

> Thought was that it would be more suitable to push all the complexity of
> such translation into user space which brings couple of additional advantages
> as well: the translation can become very complex and thus it would contain
> all of it behind syscall boundary where natural path of loading programs
> would go via verifier. Given the tool would reside in user space, it would
> also allow to ease development and testing can happen w/o recompiling the
> kernel. It would allow for all the clang sanitizers to run there and for
> having a comprehensive test suite to verify and dry test translations against
> traffic test patterns (e.g. bpf infra would provide possibilities on this
> w/o complex setup). Given normal user mode helpers make this rather painful
> since they need to be shipped as extra package by the various distros, the
> idea was that the module loader back end could treat umh similarly as kernel
> modules and hook them in through request_module() approach while still
> operating out of user space. In any case, I could image this approach might
> be interesting and useful in general also for other subsystems requiring
> umh in one way or another.

Yes, this is a very powerful new facility.

It also means that the scope of developers who can contribute and work
on the translater is much larger.

When we showed this infrastructure to Linus he thought it was a very
sane idea.

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread Florian Westphal

David Miller  wrote:
> > How many of those wide-spread applications are you aware of?  The two
> > projects you have pointed out (docker and kubernetes) don't. As the
> > assumption that many such tools would need to be supported drives a lot
> > of the design decisions, I would argue one needs a solid empircal basis.
> 
> I see talk about "just replacing the iptables binary".
> 
> A long time ago in a galaxy far far away, that would be a reasonable
> scheme.  But that kind of approach won't work in the realities of
> today.
> 
> You aren't going to be able to replace the iptables binary in the tens
> of thousands of container images out there, nor the virtualization
> installations either.

Why would you have to?
iptables kernel parts are still maintained, its not dead code that
stands in the way.

We can leave it alone, in maintenance mode, just fine.

> Like it or not iptables ABI based filtering is going to be in the data
> path for many years if not a decade or more to come.  iptables is a
> victim of it's own success, like it or not :-) Yes, the ABI is
> terrible, but obviously it was useful enough for lots of people.

Sure, but looking at all the things that were added to iptables
to alleviate some of the issues (ipset for instance) show that we need a
meaningful re-design of how things work conceptually.

The umh helper translation that has been proposed could be applied to
transparently xlate iptables to nftables (or e.g. iptables compat32 to
iptables64), i.e. legacy binary talks to kernel, kernel invokes umh, umh
generates nftables netlink messages).

But I don't even see a need to do this, I don't think its an issue
to leave it in the tree even for another decade or more if needed be.

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread David Miller

From: Harald Welte 
Date: Mon, 19 Feb 2018 13:52:18 +0100

>> Right, having a custom iptables, libiptc or LD_PRELOAD approach would work
>> as well of course, but it still wouldn't address applications that have
>> their own custom libs programmed against iptables uapi directly or those
>> that reused a builtin or modified libiptc directly in their application.
> 
> How many of those wide-spread applications are you aware of?  The two
> projects you have pointed out (docker and kubernetes) don't. As the
> assumption that many such tools would need to be supported drives a lot
> of the design decisions, I would argue one needs a solid empircal basis.

I see talk about "just replacing the iptables binary".

A long time ago in a galaxy far far away, that would be a reasonable
scheme.  But that kind of approach won't work in the realities of
today.

You aren't going to be able to replace the iptables binary in the tens
of thousands of container images out there, nor the virtualization
installations either.

Like it or not iptables ABI based filtering is going to be in the data
path for many years if not a decade or more to come.  iptables is a
victim of it's own success, like it or not :-) Yes, the ABI is
terrible, but obviously it was useful enough for lots of people.

Therefore it behooves us to accept this reality and align the data
path generated to match what the rest of the kernel is moving towards
and that is eBPF and XDP.

Furthrmore, on a long term maintainence perspective, it means that
every data path used by the kernel for iptables will be fully verified
by the eBPF verifier.  This means that the iptables data path will be
guaranteed to never get into a loop, access out of bounds data, etc.

That to me is real power, and something we should pursue.

This doesn't even get into the offloading and other benefits that are
possible.  I know you can't see how offloading is possible, but I hope
are some further discussion you can see how that might work.

Thanks.

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread Harald Welte

Hi Daniel,

On Mon, Feb 19, 2018 at 01:03:17PM +0100, Daniel Borkmann wrote:
> Hi Harald,
> 
> On 02/17/2018 01:11 PM, Harald Welte wrote:
> [...]
> >> As rule translation can potentially become very complex, this is performed
> >> entirely in user space. In order to ease deployment, request_module() code
> >> is extended to allow user mode helpers to be invoked. Idea is that user 
> >> mode
> >> helpers are built as part of the kernel build and installed as traditional
> >> kernel modules with .ko file extension into distro specified location,
> >> such that from a distribution point of view, they are no different than
> >> regular kernel modules. 
> > 
> > That just blew my mind, sorry :)  This goes much beyond
> > netfilter/iptables, and adds some quiet singificant new piece of
> > kernel/userspace infrastructure.  To me, my apologies, it just sounds
> > like a quite strange hack.  But then, I may lack the vision of how this
> > might be useful in other contexts.
> 
> Thought was that it would be more suitable to push all the complexity of
> such translation into user space [...]

Sure, you have no complaints from my side about that goal.  I'm just not
sure if turning the kernel module loader into a new mechanism to start
userspace processes is.  I guess that's a question that the people
involved with core kernel code and module loader have to answer.  To me
it seems like a very lng detour away from the actual topic (packet
filtering).

> Given normal user mode helpers make this rather painful since they
> need to be shipped as extra package by the various distros, the idea
> was that the module loader back end could treat umh similarly as
> kernel modules and hook them in through request_module() approach
> while still operating out of user space. In any case, I could image
> this approach might be interesting and useful in general also for
> other subsystems requiring umh in one way or another.

I completely agree this approach has some logic to it.  I just think the
approach taken is *very* different from what has been traditionally done
in the Linux world.  All sorts of userspace programs to configure kernel
features (iptables being one of them iproute2, etc.) have always been
distributed as separate/independent application programs, which are
packaged separately, etc.

Making the kernel source tree build such userspace utilities and
executing them in a new fashion via the kernel module loaders are to me
two quite large conceptual changes on how "Linux works", and I believe
you will have to "sell" this idea to many people outside the kernel
networking communit, i.e. core kernel developers, people who do
packaging, etc.

I'm not saying I'm fundamentally opposed to it.  Will be curious to see
how the wider kernel community thinks of that architecture.

> Right, having a custom iptables, libiptc or LD_PRELOAD approach would work
> as well of course, but it still wouldn't address applications that have
> their own custom libs programmed against iptables uapi directly or those
> that reused a builtin or modified libiptc directly in their application.

How many of those wide-spread applications are you aware of?  The two
projects you have pointed out (docker and kubernetes) don't. As the
assumption that many such tools would need to be supported drives a lot
of the design decisions, I would argue one needs a solid empircal basis.

Also, the LD_PRELOAD wrapper *would* work with all those programs.  Only
the iptables command line replacement wouldn't catch those.

> Such requests could only be covered transparently by having a small shim
> layer in kernel and it also wouldn't require any extra packages from distro
> side.

What is wrong with extra packages in distributions?  Distributions also
will have to update the kernel to include your new code, so they could
at the same time use a new iptables (or $whatever) package.  This is
true for virtually all new kernel features.  Your userland needs to go
along with it, if it wants to use those new features.

> > Some of those can be implemented easily in BPF (like recomputing the
> > checksum or the like).   Some others I would find much more difficult -
> > particularly if you want to off-load it to the NIC.  They require access
> > to state that only the kernel has (like 'cgroup' or 'owner' matching).
> 
> Yeah, when it comes to offloading, the latter two examples are heavily tied
> to upper layers of the (local) stack, so for cases like those it wouldn't
> make much sense, but e.g. matches, mangling or forwarding based on packet
> data are obvious candidates that can already be offloaded today in a
> flexible and programmable manner all with existing BPF infra, so for those
> it could definitely be highly interesting to make use of it.

While I believe you there are many ways how one can offload things
flexibly with eBPF, I still have a hard time understanding how you want
to merge this with the existing well-defined notion of when exactly a
given chain

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread Daniel Borkmann

Hi Harald,

On 02/17/2018 01:11 PM, Harald Welte wrote:
[...]
>> As rule translation can potentially become very complex, this is performed
>> entirely in user space. In order to ease deployment, request_module() code
>> is extended to allow user mode helpers to be invoked. Idea is that user mode
>> helpers are built as part of the kernel build and installed as traditional
>> kernel modules with .ko file extension into distro specified location,
>> such that from a distribution point of view, they are no different than
>> regular kernel modules. 
> 
> That just blew my mind, sorry :)  This goes much beyond
> netfilter/iptables, and adds some quiet singificant new piece of
> kernel/userspace infrastructure.  To me, my apologies, it just sounds
> like a quite strange hack.  But then, I may lack the vision of how this
> might be useful in other contexts.

Thought was that it would be more suitable to push all the complexity of
such translation into user space which brings couple of additional advantages
as well: the translation can become very complex and thus it would contain
all of it behind syscall boundary where natural path of loading programs
would go via verifier. Given the tool would reside in user space, it would
also allow to ease development and testing can happen w/o recompiling the
kernel. It would allow for all the clang sanitizers to run there and for
having a comprehensive test suite to verify and dry test translations against
traffic test patterns (e.g. bpf infra would provide possibilities on this
w/o complex setup). Given normal user mode helpers make this rather painful
since they need to be shipped as extra package by the various distros, the
idea was that the module loader back end could treat umh similarly as kernel
modules and hook them in through request_module() approach while still
operating out of user space. In any case, I could image this approach might
be interesting and useful in general also for other subsystems requiring
umh in one way or another.

> I'm trying to understand why exactly one would
> * use a 18 year old iptables userspace program with its equally old
>   setsockopt based interface between kernel and userspace
> * insert an entire table with many chains of rules into the kernel
> * re-eject that ruleset into another userspace program which then
>   compiles it into an eBPF program
> * inserert that back into the kernel
> 
> To me, this looks like some kind of legacy backwards compatibility
> mechanism that one would find in proprietary operating systems, but not
> in Linux.  iptables, libiptc etc. are all free software.  The source
> code can be edited, and you could just as well have a new version of
> iptables and/or libiptc which would pass the ruleset in userspace to
> your compiler, which would then insert the resulting eBPF program.
> 
> You could even have a LD_PRELOAD wrapper doing the same.  That one
> would even work with direct users of the iptables setsockopt inteerface.
> 
> Why add quite comprehensive kerne infrastructure?  What's the motivation
> here?

Right, having a custom iptables, libiptc or LD_PRELOAD approach would work
as well of course, but it still wouldn't address applications that have
their own custom libs programmed against iptables uapi directly or those
that reused a builtin or modified libiptc directly in their application.
Such requests could only be covered transparently by having a small shim
layer in kernel and it also wouldn't require any extra packages from distro
side.

[...]
>> In the implemented proof of concept we show that simple /32 src/dst IPs
>> are translated in such manner. 
> 
> Of course this is the first that one starts with.  However, as we all
> know, iptables was never very good or efficient about 5-tuple matching.
> If you want a fast implementation of this, you don't use iptables which
> does linear list iteration.  The reason/rationale/use-case of iptables
> is its many (I believe more than 100 now?) extensions both on the area
> of matches and targets.
> 
> Some of those can be implemented easily in BPF (like recomputing the
> checksum or the like).   Some others I would find much more difficult -
> particularly if you want to off-load it to the NIC.  They require access
> to state that only the kernel has (like 'cgroup' or 'owner' matching).

Yeah, when it comes to offloading, the latter two examples are heavily tied
to upper layers of the (local) stack, so for cases like those it wouldn't
make much sense, but e.g. matches, mangling or forwarding based on packet
data are obvious candidates that can already be offloaded today in a
flexible and programmable manner all with existing BPF infra, so for those
it could definitely be highly interesting to make use of it.

>> In the below example, we show that dumping, loading and offloading of
>> one or multiple simple rules work, we show the bpftool XDP dump of the
>> generated BPF instruction sequence as well as a simple functional ping
>> test to enforce poli

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-18 Thread Florian Westphal

Daniel Borkmann  wrote:
> As rule translation can potentially become very complex, this is performed
> entirely in user space. In order to ease deployment, request_module() code
> is extended to allow user mode helpers to be invoked. Idea is that user mode
> helpers are built as part of the kernel build and installed as traditional
> kernel modules with .ko file extension into distro specified location,
> such that from a distribution point of view, they are no different than
> regular kernel modules. Thus, allow request_module() logic to load such
> user mode helper (umh) binaries via:
> 
>   request_module("foo") ->
> call_umh("modprobe foo") ->
>   sys_finit_module(FD of /lib/modules/.../foo.ko) ->
> call_umh(struct file)
> 
> Such approach enables kernel to delegate functionality traditionally done
> by kernel modules into user space processes (either root or !root)

Unrelated:  AFAIU this would allow to e.g. move the compat32 handlers
(which are very ugly/error prone) off to userspace?

compat_syscall -> umh_32_64_xlate -> syscall() ?

[ feel free to move this to different thread, only mentioning this
  so I won't forget ]

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-17 Thread Florian Westphal

Harald Welte  wrote:
> I believe _if_ one wants to use the approach of "hiding" eBPF behind
> iptables, then either
[..]
> b) you must introduce new 'tables', like an 'xdp' table which then has
>the notion of processing very early in processing, way before the
>normal filter table INPUT processing happens.

In nftables. the netdev ingress hook location could be used for this,
but right, iptables has no equivalent.

netdev ingress is interesting from an hw-offload point of view,
unlike all other netfilter hooks its tied to a specific network interface
rather than owned by the network namespace.

A rule like (yes i am making this up)
limit 1 byte/s

cannot be offloaded because it affects all packets going through
the system, i.e. you'd need to share state among all nics which i think
won't work :-)

Same goes for any other match/target that somehow contains (global)
state and was added to the 'classic' iptables hook points.
(exception: rule restricts interface via '-i foo').

Note well: "offloaded != ebpf" in this case.

I see no reasons why ebpf cannot be used in either iptables or
nftables.  How to get there is obviously a different beast.

For iptables, I think we should put it in maintenance mode and
focus on nftables, for many reasons outlined in other replies.

And how to best make use of ebpf+nftables

In ideal world, nftables would have used (e)bpf from the start.
But, well, its not an ideal world (iirc nft origins are just a bit
too old).

That doesn't mean that we can't leverage ebpf from nftables.
Its just a question of where it makes sense and where it doesn't,
f.e. i see no reason to replace c code with ebpf just 'because you can'.

Speedup?  Good argument.
Feature enhancements that could use ebpf programs? Another good
argument.

I guess there are a lot more.

So I'd like to second Haralds question.

What is the main goal?

For nftables, I believe most important ones are:
- make kernel keeper/owner of all rules
- allow userspace to learn of rule addition/deletion
- provide fast matching (no linear evaluation of rules,
native sets with jump and verdict maps)
- provide a single tool instead of ip/ip6/arp/ebtables
- unified ipv4/ipv6 matching
- backwards compat and/or translation infrastructure

But once these are reached, we will hopefully have more:
- offloading (hardware)
- speedup via JIT compilation
- feature enhancements such as matching arbitrary packet
contents

I suspect you see that ebpf might be a fit and/or help us with
all of these things.

So, once I understand what your goals are I might be better able
to see how nftables could fit into the picture, as you can see
I did a lot of guesswork :-)

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-17 Thread Florian Westphal

Florian Westphal  wrote:
> David Miller  wrote:
> > From: Florian Westphal 
> > Date: Fri, 16 Feb 2018 17:14:08 +0100
> > 
> > > Any particular reason why translating iptables rather than nftables
> > > (it should be possible to monitor the nftables changes that are
> > >  announced by kernel and act on those)?
> > 
> > As Daniel said, iptables is by far the most deployed of the two
> > technologies.  Therefore it provides the largest environment for
> > testing and coverage.
> 
> Right, but the approach of hooking old blob format comes with
> lots of limitations that were meant to be resolved with a netlink based
> interface which places kernel in a position to mediate all transactions
> to the rule database (which isn't fixable with old setsockopt format).
> 
> As all programs call iptables(-restore) or variants translation can
> be done in userspace to nftables so api spoken is nfnetlink.
> Such a translator already exists and can handle some cases already:
> 
> nft flush ruleset
> nft list ruleset | wc -l
> 0
> xtables-compat-multi iptables -A INPUT -i eth0 -m conntrack --ctstate 
> ESTABLISHED,RELATED -j ACCEPT
> xtables-compat-multi iptables -A REJECT_LOG -i eth0 -p tcp --tcp-flags 
> SYN,ACK SYN --dport 22:80 -m limit --limit 1/sec -j LOG --log-prefix 
> "RejectTCPConnectReq"

to be fair, for these two I had to use
$(xtables-compat-multi iptables-translate -A INPUT -i eth0 -m conntrack 
--ctstate ESTABLISHED,RELATED -j ACCEPT)

Reason is that the 'iptables-translate' part nowadays has way more
translations available (nft gained many features since the
iptables-compat layer was added).

If given appropriate prioriy however it should be pretty
trivial to make the 'translate' descriptions available in
the 'direct' version, we already have function in libnftables
to execute/run a command directly from a buffer so this would
not even need fork/execve overhead (although I don't think
its a big concern).

> (f.e. nftables misses some selinux matches/targets for netlabel so we 
> obviously
> can't translate this, same for ipsec sa/policy matching -- but this isn't
> impossible to resolve).

I am working on some poc code for the sa/policy thing now.

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-17 Thread Florian Westphal

David Miller  wrote:
> From: Florian Westphal 
> Date: Fri, 16 Feb 2018 17:14:08 +0100
> 
> > Any particular reason why translating iptables rather than nftables
> > (it should be possible to monitor the nftables changes that are
> >  announced by kernel and act on those)?
> 
> As Daniel said, iptables is by far the most deployed of the two
> technologies.  Therefore it provides the largest environment for
> testing and coverage.

Right, but the approach of hooking old blob format comes with
lots of limitations that were meant to be resolved with a netlink based
interface which places kernel in a position to mediate all transactions
to the rule database (which isn't fixable with old setsockopt format).

As all programs call iptables(-restore) or variants translation can
be done in userspace to nftables so api spoken is nfnetlink.
Such a translator already exists and can handle some cases already:

nft flush ruleset
nft list ruleset | wc -l
0
xtables-compat-multi iptables -A INPUT -s 192.168.0.24 -j ACCEPT
xtables-compat-multi iptables -A INPUT -s 192.168.0.0/16 -p tcp --dport 22 -j 
ACCEPT
xtables-compat-multi iptables -A INPUT -i eth0 -m conntrack --ctstate 
ESTABLISHED,RELATED -j ACCEPT
xtables-compat-multi iptables -A INPUT -p icmp -j ACCEPT
xtables-compat-multi iptables -N REJECT_LOG
xtables-compat-multi iptables -A REJECT_LOG -i eth0 -p tcp --tcp-flags SYN,ACK 
SYN --dport 22:80 -m limit --limit 1/sec -j LOG --log-prefix 
"RejectTCPConnectReq"
xtables-compat-multi iptables -A REJECT_LOG -j DROP
xtables-compat-multi iptables -A INPUT -j REJECT_LOG

nft list ruleset
table ip filter {
chain INPUT {
type filter hook input priority 0; policy accept;
ip saddr 192.168.0.24 counter packets 0 bytes 0 accept
ip saddr 192.168.0.0/16 tcp dport 22 counter accept
iifname "eth0" ct state related,established counter accept
ip protocol icmp counter packets 0 bytes 0 accept
counter packets 0 bytes 0 jump REJECT_LOG
}

chain FORWARD {
type filter hook forward priority 0; policy accept;
}

chain OUTPUT {
type filter hook output priority 0; policy accept;
}

chain REJECT_LOG {
iifname "eth0" tcp dport 22-80 tcp flags & (syn | ack) == syn 
limit rate 1/second burst 5 packets counter packets 0 bytes 0 log prefix 
"RejectTCPConnectReq"
counter packets 0 bytes 0 drop
}
}

and, while 'iptables' rules were added, nft monitor in different terminal:
nft monitor
add table ip filter
add chain ip filter INPUT { type filter hook input priority 0; policy accept; }
add chain ip filter FORWARD { type filter hook forward priority 0; policy 
accept; }
add chain ip filter OUTPUT { type filter hook output priority 0; policy accept; 
}
add rule ip filter INPUT ip saddr 192.168.0.24 counter packets 0 bytes 0 accept
# new generation 9893 by process 7471 (xtables-compat-)
add rule ip filter INPUT ip saddr 192.168.0.0/16 tcp dport 22 counter accept
# new generation 9894 by process 7504 (xtables-compat-)
add rule ip filter INPUT iifname "eth0" ct state related,established counter 
accept
# new generation 9895 by process 7528 (xtables-compat-)
add rule ip filter INPUT ip protocol icmp counter packets 0 bytes 0 accept
# new generation 9896 by process 7542 (xtables-compat-)
add chain ip filter REJECT_LOG
# new generation 9897 by process 7595 (xtables-compat-)
add rule ip filter REJECT_LOG iifname "eth0" tcp dport 22-80 tcp flags & (syn | 
ack) == syn limit rate 1/second burst 5 packets counter packets 0 bytes 0 log 
prefix "RejectTCPConnectReq"
# new generation 9898 by process 7639 (xtables-compat-)
add rule ip filter REJECT_LOG counter packets 0 bytes 0 drop
# new generation 9899 by process 7657 (xtables-compat-)
add rule ip filter INPUT counter packets 0 bytes 0 jump REJECT_LOG
# new generation 9900 by process 7663 (xtables-compat-)

Now, does this work in all cases?

Unfortunately not -- this is still work-in-progress, so I would
not rm /sbin/iptables and replace it with a link to xtables-compat-multi just 
yet.

(f.e. nftables misses some selinux matches/targets for netlabel so we obviously
can't translate this, same for ipsec sa/policy matching -- but this isn't
impossible to resolve).

Hopefully this does show that at least some commonly used features work
and that we've come a long way to make seamless nftables transition happen.

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-17 Thread Florian Westphal

Daniel Borkmann  wrote:
> Hi Florian,
> 
> On 02/16/2018 05:14 PM, Florian Westphal wrote:
> > Florian Westphal  wrote:
> >> Daniel Borkmann  wrote:
> >> Several questions spinning at the moment, I will probably come up with
> >> more:
> > 
> > ... and here there are some more ...
> > 
> > One of the many pain points of xtables design is the assumption of 'used
> > only by sysadmin'.
> > 
> > This has not been true for a very long time, so by now iptables has
> > this userspace lock (yes, its fugly workaround) to serialize concurrent
> > iptables invocations in userspace.
> > 
> > AFAIU the translate-in-userspace design now brings back the old problem
> > of different tools overwriting each others iptables rules.
> 
> Right, so the behavior would need to be adapted to be exactly the same,
> given all the requests go into kernel space first via the usual uapis,
> I don't think there would be anything in the way of keeping that as is.

Uff.  This isn't solveable.  At least thats what I tried to say here.
This is a limitation of the xtables setsockopt interface design.

If $docker (or anything else) adds a new rule using plain iptables other
daemons are not aware of it.

If some deletes a rule added by $software it won't learn that either.

The "solutions" in place now (periodic reloads/'is my rule still in
place' etc. are not desirable long-term.

You'll also need 4 decoders for arp/ip/ip6/ebtables plus translations
for all matches and targets xtables currently has. (almost 100 i would
guess from quick glance).

Some of the more crazy ones also have external user visible interfaces
outside setsockopt (proc files, ipset).

> > One of the nftables advantages is that (since rule representation in
> > kernel is black-box from userspace point of view) is that the kernel
> > can announce add/delete of rules or elements from nftables sets.
> > 
> > Any particular reason why translating iptables rather than nftables
> > (it should be possible to monitor the nftables changes that are
> >  announced by kernel and act on those)?
> 
> Yeah, correct, this should be possible as well. We started out with the
> iptables part in the demo as the majority of bigger infrastructure projects
> all still rely heavily on it (e.g. docker, k8s to just name two big ones).

Yes, which is why we have translation tools in place.

Just for the fun of it I tried to delete ip/ip6tables binaries on my
fedora27 laptop and replaced them with symlinks to
'xtables-compat-multi'.

Aside from two issues (SELinux denying 'iptables' to use netlink) and
one translation issue (-m rpfilter, which can be translated in current
upstream version) this works out of the box, the translator uses
nftables api to kernel (so kernel doesn't even know which program is
talking...), 'nft monitor' displays the rules being added, and
'nft list ruleset' shows the default firewalld ruleset.

Obviously there are a few limitations, for instance ip6tables-save will
stop working once you add nft-based rules that use features that cannot
be expressed in xtables syntax (it will throw an error message similar
to 'you are using nftables featues not available in xtables, please use
nft'), for intance verdict maps, sets and the like.

> Usually they have their requests to iptables baked into their code directly
> which probably won't change any time soon, so thought was that they could
> benefit initially from it once there would be sufficient coverage.

See above, the translator covers most basic use cases nowadays.
The more extreme cases are not covered because we were reluctant to
provide equivalent in nftables (-m time comes to mind which was always a
PITA because kernel has no notion of timezone or DST transitions,
leading to 'magic' mismatches when timezone changes...

I could explain on more problem cases but none of them are too
important I think.

If you'd like to have more ebpf users in the kernel, then there is at
least one use case where ebpf could be very attractive for nftables
(matching dynamic headers and the like).  This would be a new
feature and would need changes on nftables userspace side
as well (we don't have syntax/grammar to represent this in either
nft or iptables).

In most basic form, it would be nftables replacement for '-m string'
(and perhaps also -m bpf to some degree, depends on how it would be
 realized).

We can discuss more if there is interest, but I think it
would be more suitable for conference/face to face discussion.

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-17 Thread Harald Welte

Hi Daniel,

On Fri, Feb 16, 2018 at 02:40:19PM +0100, Daniel Borkmann wrote:
> This is a very rough and early proof of concept that implements bpfilter.
> The basic idea of bpfilter is that it can process iptables queries and
> translate them in user space into BPF programs which can then get attached
> at various locations. 

Interesting approach.  My first question would be what the goal of all of this
is.  For sure, one can implement many different things, but what is the use
case, and why do it this way?

I see several possible areas of contention:

1) If you aim for a non-feature-complete support of iptables rules, it
   will create confusion to the users.  When users use "iptables", they have
   assumptions on what it will do and how it will behave.  One can of course
   replace / refactor the internal implementation, if the resulting behavior
   is identical.  And that means rules are executed at the same hooks in the 
stack,
   with functionally identical matches and targets, provide the same
   counter semantics, etc.  But if the behavior is different, and/or the
   provided functionality is different, then why "hide" this new
   filtering technology behind iptables, rather than its own command
   line tool?  Such an alternative tool could share the same command
   line syntax as iptables, or even provide a converter/wrapper, but
   given that it would not be called "iptables" people will implicitly
   have different assumptions about it

2) Why try to provide compatibility to iptables, when at the same time
   many people have already migrated to (or are in the process of
   migrating) to nftables?  By using iptables semantics, structures,
   architecture, you risk perpetuating the design mistakes we made in
   iptables some 18 years ago for another decade or more.  From my POV,
   if one was to do eBPF optimized rule execution, it should be based on
   nftables rather than iptables.  This way you avoid the many
   architectural problems, such as
   * no incremental rule changes but only atomic swap of an entire table
 with all its chains
   * no common/shared rulesets for IPv4 + IPv6, which is very clumsy and
 often worked around with ugly shellscript wrappers in userspace
 which then call both iptables and ip6tables to add a rule to both
 rulesets.

> The user space iptables binary issuing rule addition or dumps was
> left as-is, thus at some point any binaries against iptables uapi kernel
> interface could transparently be supported in such manner in long term.

See my comments above:  In the netfilter community, we know for at least
a decade or more about the many problems of the old iptables userspace
interface.  For many years, a much better replacement has been designed
as part of nftables.

> As rule translation can potentially become very complex, this is performed
> entirely in user space. In order to ease deployment, request_module() code
> is extended to allow user mode helpers to be invoked. Idea is that user mode
> helpers are built as part of the kernel build and installed as traditional
> kernel modules with .ko file extension into distro specified location,
> such that from a distribution point of view, they are no different than
> regular kernel modules. 

That just blew my mind, sorry :)  This goes much beyond
netfilter/iptables, and adds some quiet singificant new piece of
kernel/userspace infrastructure.  To me, my apologies, it just sounds
like a quite strange hack.  But then, I may lack the vision of how this
might be useful in other contexts.

I'm trying to understand why exactly one would
* use a 18 year old iptables userspace program with its equally old
  setsockopt based interface between kernel and userspace
* insert an entire table with many chains of rules into the kernel
* re-eject that ruleset into another userspace program which then
  compiles it into an eBPF program
* inserert that back into the kernel

To me, this looks like some kind of legacy backwards compatibility
mechanism that one would find in proprietary operating systems, but not
in Linux.  iptables, libiptc etc. are all free software.  The source
code can be edited, and you could just as well have a new version of
iptables and/or libiptc which would pass the ruleset in userspace to
your compiler, which would then insert the resulting eBPF program.

You could even have a LD_PRELOAD wrapper doing the same.  That one
would even work with direct users of the iptables setsockopt inteerface.

Why add quite comprehensive kerne infrastructure?  What's the motivation
here?

> Thus, allow request_module() logic to load such
> user mode helper (umh) binaries via:
> 
>   request_module("foo") ->
> call_umh("modprobe foo") ->
>   sys_finit_module(FD of /lib/modules/.../foo.ko) ->
> call_umh(struct file)
> 
> Such approach enables kernel to delegate functionality traditionally done
> by kernel modules into user space processes (either root or !root) and
> reduces security attack sur

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-17 Thread Harald Welte

Hi Daniel,

On Fri, Feb 16, 2018 at 09:44:01PM +0100, Daniel Borkmann wrote:
> We started out with the
> iptables part in the demo as the majority of bigger infrastructure projects
> all still rely heavily on it (e.g. docker, k8s to just name two big ones).

docker is exec'ing the iptables command line program.  So one could simply
offer a syntactically compatible userspace replacement that does the compilation
in userspce and avoid the iptables->libiptc->setsockopt->userspace roundtrip
and the associated changes to the kernel module loader you introduced.

kubernetes is using iptables-restore, which is part of iptables and
again has the same syntax.  However, it aovids the per-rule fork+exec
overhead, which is why the netfilter project has been recommending it to
be used in such situations.

Do you have a list of known projects that use the legacy sockopt-based
iptables uapi directly, without using code from the iptables.git
codebase (e.g. libiptc, iptables or iptables-restore)?  IMHO only
those projects would benefit from the approach you have taken vs. an
approach that simply offers a compatible commandline syntax.

> Usually they have their requests to iptables baked into their code directly
> which probably won't change any time soon, so thought was that they could
> benefit initially from it once there would be sufficient coverage.

If the binary offeers the same syntax (it could even be a fork/version
of the iptables codebase, only using the parsing without the existing
backend generating the ruleS), the same goal could be achieved.

The above of course assumes that you have a 100% functional replacement
(for 100% of the features that your use cases use) underneath the
"iptables command syntax" compatibility.  But you need that in both
cases, whether you use the existing userspace api or not.

Regards,
Harald
-- 
- Harald Weltehttp://laforge.gnumonks.org/

"Privacy in residential applications is a desirable marketing option."
  (ETSI EN 300 175-7 Ch. A6)

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-17 Thread Harald Welte

Hi David,

On Fri, Feb 16, 2018 at 05:33:54PM -0500, David Miller wrote:
> From: Florian Westphal 
> 
> > Any particular reason why translating iptables rather than nftables
> > (it should be possible to monitor the nftables changes that are
> >  announced by kernel and act on those)?
> 
> As Daniel said, iptables is by far the most deployed of the two
> technologies.  Therefore it provides the largest environment for
> testing and coverage.

As I outlined earlier, this way you are perpetuating the architectural
mistakes and constraints that were created ~ 18 years ago without any
benefit from the lessons learned ever since.  In netfilter, we already
wanted to replace it as early as 2006 (AFAIR) with nfnetlink based
pkttables (which never materialized).

I would strongly suggest to focus on nftables (or even some other way of
configuration / userspace interaction) to ensure that the iptables
userspace interface can at some point be phased out eventually.  Like we
did with ipchains before, and before that with ipfwadm.

By making a new implementation dependant on the oldest interface you are
perpetuating it.  Sure, one can go that way, but I would suggest this to
be a *very* carefully weighed decision after a detailed
analysis/discusison.

-- 
- Harald Weltehttp://laforge.gnumonks.org/

"Privacy in residential applications is a desirable marketing option."
  (ETSI EN 300 175-7 Ch. A6)

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-16 Thread David Miller

From: Florian Westphal 
Date: Fri, 16 Feb 2018 17:14:08 +0100

> Any particular reason why translating iptables rather than nftables
> (it should be possible to monitor the nftables changes that are
>  announced by kernel and act on those)?

As Daniel said, iptables is by far the most deployed of the two
technologies.  Therefore it provides the largest environment for
testing and coverage.

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-16 Thread David Miller

From: Florian Westphal 
Date: Fri, 16 Feb 2018 15:57:27 +0100

> 4. Do you plan to reimplement connection tracking in userspace?
> If no, how will the bpf program interact with it?

The natural way to handle this, as with anything BPF related, is with
appropriate BPF helpers which would be added for this purpose.

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-16 Thread Daniel Borkmann

Hi Florian,

On 02/16/2018 05:14 PM, Florian Westphal wrote:
> Florian Westphal  wrote:
>> Daniel Borkmann  wrote:
>> Several questions spinning at the moment, I will probably come up with
>> more:
> 
> ... and here there are some more ...
> 
> One of the many pain points of xtables design is the assumption of 'used
> only by sysadmin'.
> 
> This has not been true for a very long time, so by now iptables has
> this userspace lock (yes, its fugly workaround) to serialize concurrent
> iptables invocations in userspace.
> 
> AFAIU the translate-in-userspace design now brings back the old problem
> of different tools overwriting each others iptables rules.

Right, so the behavior would need to be adapted to be exactly the same,
given all the requests go into kernel space first via the usual uapis,
I don't think there would be anything in the way of keeping that as is.

> Another question -- am i correct in that each rule manipulation would
> incur a 'recompilation'?  Or are there different mini programs chained
> together?

Right now in the PoC yes, basically it regenerates the program on the fly
in gen.c when walking the struct bpfilter_ipt_ip's and appends the entries
to the program, but it doesn't have to be that way. There are multiple
options to allow for a partial code generation, e.g. via chaining tail
call arrays or directly via BPF to BPF calls eventually, there would be
few changes on BPF side needed, but it can be done; there could additionally
be various optimizations passes during code generation phase performed
while keeping given constraints in order to speed up getting to a verdict.

> One of the nftables advantages is that (since rule representation in
> kernel is black-box from userspace point of view) is that the kernel
> can announce add/delete of rules or elements from nftables sets.
> 
> Any particular reason why translating iptables rather than nftables
> (it should be possible to monitor the nftables changes that are
>  announced by kernel and act on those)?

Yeah, correct, this should be possible as well. We started out with the
iptables part in the demo as the majority of bigger infrastructure projects
all still rely heavily on it (e.g. docker, k8s to just name two big ones).
Usually they have their requests to iptables baked into their code directly
which probably won't change any time soon, so thought was that they could
benefit initially from it once there would be sufficient coverage.

Thanks,
Daniel

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-16 Thread Daniel Borkmann

Hi Florian,

thanks for your feedback! More inline:

On 02/16/2018 03:57 PM, Florian Westphal wrote:
> Daniel Borkmann  wrote:
>> This is a very rough and early proof of concept that implements bpfilter.
> 
> [..]
> 
>> Also, as a benefit from such design, we get BPF JIT compilation on x86_64,
>> arm64, ppc64, sparc64, mips64, s390x and arm32, but also rule offloading
>> into HW for free for Netronome NFP SmartNICs that are already capable of
>> offloading BPF since we can reuse all existing BPF infrastructure as the
>> back end. The user space iptables binary issuing rule addition or dumps was
>> left as-is, thus at some point any binaries against iptables uapi kernel
>> interface could transparently be supported in such manner in long term.
>>
>> As rule translation can potentially become very complex, this is performed
>> entirely in user space. In order to ease deployment, request_module() code
>> is extended to allow user mode helpers to be invoked. Idea is that user mode
>> helpers are built as part of the kernel build and installed as traditional
>> kernel modules with .ko file extension into distro specified location,
>> such that from a distribution point of view, they are no different than
>> regular kernel modules. Thus, allow request_module() logic to load such
>> user mode helper (umh) binaries via:
>>
>>   request_module("foo") ->
>> call_umh("modprobe foo") ->
>>   sys_finit_module(FD of /lib/modules/.../foo.ko) ->
>> call_umh(struct file)
>>
>> Such approach enables kernel to delegate functionality traditionally done
>> by kernel modules into user space processes (either root or !root) and
>> reduces security attack surface of such new code, meaning in case of
>> potential bugs only the umh would crash but not the kernel. Another
>> advantage coming with that would be that bpfilter.ko can be debugged and
>> tested out of user space as well (e.g. opening the possibility to run
>> all clang sanitizers, fuzzers or test suites for checking translation).
> 
> Several questions spinning at the moment, I will probably come up with
> more:

Sure, no problem at all. It's an early RFC, so purpose is to get a
discussion going on such potential approach.

> 1. Does this still attach the binary blob to the 'normal' iptables
>hooks?

Yeah, so thought would be to keep the user land tooling functional as
is w/o having to recompile binaries, thus this would also need to attach
for the existing hooks in order to keep semantics working. As a benefit
in addition we can also reuse all the rest of the infrastructure to utilize
things like XDP for iptables in the background, there is definitely
flexibility on this side thus users could eventually benefit from this
transparently and don't need to know that 'bpfilter' exists and is
translating in the background. I realize taking this path is a long term
undertake that we would need to tackle as a community, not just one or
two individuals when we decide to go for this direction.

> 2. If yes, do you see issues wrt. 'iptables' and 'bpfilter' attached
> programs being different in nature (e.g. changed by different entities)?

There could certainly be multiple options, e.g. a fall-through with state
transfer once a request cannot be handled yet or a sysctl with iptables
being the default handler and an option to switch to bpfilter for letting
it handle requests for that time being.

> 3. What happens if the rule can't be translated (yet?)

(See above.)

> 4. Do you plan to reimplement connection tracking in userspace?

One option could be to have a generic, skb-less connection tracker in kernel
that can be reused from the various hooks it would need to handle, potentially
that would also be able to get offloaded into HW as another benefit coming
out from that.

> If no, how will the bpf program interact with it?
> [ same question applies to ipv6 exthdr traversal, ip defragmentation
> and the like ].

The v6 exthdr traversal could be realized natively via BPF which should
make the parsing more robust at the same time than having it somewhere
inside a helper in kernel directly; bounded loops in BPF would help as
well on that front, similarly for defrag this could be handled by the prog
although here we would need additional infra to queue the packets and then
recirculate.

> I will probably have a quadrillion of followup questions, sorry :-/

Definitely, please do!

Thanks,
Daniel

>> Also, such architecture makes the kernel/user boundary very precise,
>> meaning requests can be handled and BPF translated in control plane part
>> in user space with its own user memory etc, while minimal data plane
>> bits are in kernel. It would also allow to remove old xtables modules
>> at some point from the kernel while keeping functionality in place.
> 
> This is what we tried with nftables :-/

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-16 Thread Florian Westphal

Florian Westphal  wrote:
> Daniel Borkmann  wrote:
> Several questions spinning at the moment, I will probably come up with
> more:

... and here there are some more ...

One of the many pain points of xtables design is the assumption of 'used
only by sysadmin'.

This has not been true for a very long time, so by now iptables has
this userspace lock (yes, its fugly workaround) to serialize concurrent
iptables invocations in userspace.

AFAIU the translate-in-userspace design now brings back the old problem
of different tools overwriting each others iptables rules.

Another question -- am i correct in that each rule manipulation would
incur a 'recompilation'?  Or are there different mini programs chained
together?

One of the nftables advantages is that (since rule representation in
kernel is black-box from userspace point of view) is that the kernel
can announce add/delete of rules or elements from nftables sets.

Any particular reason why translating iptables rather than nftables
(it should be possible to monitor the nftables changes that are
 announced by kernel and act on those)?

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-16 Thread Florian Westphal

Daniel Borkmann  wrote:
> This is a very rough and early proof of concept that implements bpfilter.

[..]

> Also, as a benefit from such design, we get BPF JIT compilation on x86_64,
> arm64, ppc64, sparc64, mips64, s390x and arm32, but also rule offloading
> into HW for free for Netronome NFP SmartNICs that are already capable of
> offloading BPF since we can reuse all existing BPF infrastructure as the
> back end. The user space iptables binary issuing rule addition or dumps was
> left as-is, thus at some point any binaries against iptables uapi kernel
> interface could transparently be supported in such manner in long term.
>
> As rule translation can potentially become very complex, this is performed
> entirely in user space. In order to ease deployment, request_module() code
> is extended to allow user mode helpers to be invoked. Idea is that user mode
> helpers are built as part of the kernel build and installed as traditional
> kernel modules with .ko file extension into distro specified location,
> such that from a distribution point of view, they are no different than
> regular kernel modules. Thus, allow request_module() logic to load such
> user mode helper (umh) binaries via:
> 
>   request_module("foo") ->
> call_umh("modprobe foo") ->
>   sys_finit_module(FD of /lib/modules/.../foo.ko) ->
> call_umh(struct file)
>
> Such approach enables kernel to delegate functionality traditionally done
> by kernel modules into user space processes (either root or !root) and
> reduces security attack surface of such new code, meaning in case of
> potential bugs only the umh would crash but not the kernel. Another
> advantage coming with that would be that bpfilter.ko can be debugged and
> tested out of user space as well (e.g. opening the possibility to run
> all clang sanitizers, fuzzers or test suites for checking translation).

Several questions spinning at the moment, I will probably come up with
more:
1. Does this still attach the binary blob to the 'normal' iptables
   hooks?
2. If yes, do you see issues wrt. 'iptables' and 'bpfilter' attached
programs being different in nature (e.g. changed by different entities)?
3. What happens if the rule can't be translated (yet?)
4. Do you plan to reimplement connection tracking in userspace?
If no, how will the bpf program interact with it?
[ same question applies to ipv6 exthdr traversal, ip defragmentation
and the like ].

I will probably have a quadrillion of followup questions, sorry :-/

> Also, such architecture makes the kernel/user boundary very precise,
> meaning requests can be handled and BPF translated in control plane part
> in user space with its own user memory etc, while minimal data plane
> bits are in kernel. It would also allow to remove old xtables modules
> at some point from the kernel while keeping functionality in place.

This is what we tried with nftables :-/

[PATCH RFC 0/4] net: add bpfilter

2018-02-16 Thread Daniel Borkmann

This is a very rough and early proof of concept that implements bpfilter.
The basic idea of bpfilter is that it can process iptables queries and
translate them in user space into BPF programs which can then get attached
at various locations. For simplicity, in this RFC we demo attaching them
to XDP layer, but any other location would work as well (e.g. at the tc
sch_clsact ingress/egress location or any other/new hook with equivalent
semantics).

Also, as a benefit from such design, we get BPF JIT compilation on x86_64,
arm64, ppc64, sparc64, mips64, s390x and arm32, but also rule offloading
into HW for free for Netronome NFP SmartNICs that are already capable of
offloading BPF since we can reuse all existing BPF infrastructure as the
back end. The user space iptables binary issuing rule addition or dumps was
left as-is, thus at some point any binaries against iptables uapi kernel
interface could transparently be supported in such manner in long term.

As rule translation can potentially become very complex, this is performed
entirely in user space. In order to ease deployment, request_module() code
is extended to allow user mode helpers to be invoked. Idea is that user mode
helpers are built as part of the kernel build and installed as traditional
kernel modules with .ko file extension into distro specified location,
such that from a distribution point of view, they are no different than
regular kernel modules. Thus, allow request_module() logic to load such
user mode helper (umh) binaries via:

  request_module("foo") ->
call_umh("modprobe foo") ->
  sys_finit_module(FD of /lib/modules/.../foo.ko) ->
call_umh(struct file)

Such approach enables kernel to delegate functionality traditionally done
by kernel modules into user space processes (either root or !root) and
reduces security attack surface of such new code, meaning in case of
potential bugs only the umh would crash but not the kernel. Another
advantage coming with that would be that bpfilter.ko can be debugged and
tested out of user space as well (e.g. opening the possibility to run
all clang sanitizers, fuzzers or test suites for checking translation).
Also, such architecture makes the kernel/user boundary very precise,
meaning requests can be handled and BPF translated in control plane part
in user space with its own user memory etc, while minimal data plane
bits are in kernel. It would also allow to remove old xtables modules
at some point from the kernel while keeping functionality in place.

In the implemented proof of concept we show that simple /32 src/dst IPs
are translated in such manner. More complex rules would be added later
as well, also different BPF code generation backends that can be selected
for the various attachment points, proper encoder/decoder for the uapi
requests, etc. This just starts out very simple and basic for the sake
of an early RFC to demo the idea.

In the below example, we show that dumping, loading and offloading of
one or multiple simple rules work, we show the bpftool XDP dump of the
generated BPF instruction sequence as well as a simple functional ping
test to enforce policy in such way.

Set rebased on top of 255442c93843 ("Merge tag 'docs-4.16' of [...]").

Feedback very welcome!

Various bpfilter usage examples from the PoC code:

1) Dumping current rules:

  # iptables -t filter -L
  Chain INPUT (policy ACCEPT)
  target prot opt source   destination

  Chain FORWARD (policy ACCEPT)
  target prot opt source   destination

  Chain OUTPUT (policy ACCEPT)
  target prot opt source   destination

2) ping test:

  # ping -c 1 127.0.0.1 -I 127.0.0.2
PING 127.0.0.1 (127.0.0.1) from 127.0.0.2 : 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.040 ms

--- 127.0.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.040/0.040/0.040/0.000 ms

3) Adding & dumping a simple rule:

  # iptables -t filter -A INPUT -i lo -s 127.0.0.2/32 -d 127.0.0.1/32 -j DROP
  # iptables -t filter -L
  Chain INPUT (policy ACCEPT)
  target prot opt source   destination
  DROP   all  --  127.0.0.2localhost

  Chain FORWARD (policy ACCEPT)
  target prot opt source   destination

  Chain OUTPUT (policy ACCEPT)
  target prot opt source   destination

4) Dump BPF generated code for that rule (on lo it's XDP generic, otherwise
   native XDP for XDP supported drivers):

  # bpftool p
18: xdp  tag 6b07f663830d5b0c
loaded_at Feb 14/01:15  uid 0
xlated 208B  not jited  memlock 4096B
  # bpftool p d x i 18
   0: (bf) r9 = r1
   1: (79) r2 = *(u64 *)(r9 +0)
   2: (79) r3 = *(u64 *)(r9 +8)
   3: (bf) r1 = r2
   4: (07) r1 += 14
   5: (bd) if r1 <= r3 goto pc+2
   6: (b4) (u32) r0 = (u32) 2
   7: (95) exit
   8: (bf) r1 = r2
   9: (b4) (u32) r5 = (u32) 0
  10: (69) r4 = *(u16 *)(r1 +12)
  11: (55) if r4 != 0x8 go

68 matches

Mail list logo