Re: [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer

2018-03-15 Thread Alexei Starovoitov
On Thu, Mar 15, 2018 at 06:00:22PM +0100, Florian Westphal wrote:
> Alexei Starovoitov  wrote:
> > The way this IMR defined today looks pretty much like nft and
> > it feels a bit too low level than iptable conversion would need.
> 
> It wasn't so much about a specific IMR but to avoid code duplication
> between nft and iptables translators.
> 
> > I think it would be simpler to have user space only extensions
> > and opcodes added to bpf for the purpose of the translation.
> > Like there is no bpf instruction called 'load from IP header',
> > but we can make one. Just extend extended bpf with an instruction
> > like this and on the first pass do full conversion of nft
> > directly into this 'extended extended bpf'.
> 
> I don't want to duplicate any ebpf conversion (and optimisations)
> in the nft part.
> 
> If nft can be translated to this 'extended extended bpf' and
> this then generates bpf code from nft input all is good.

if possible it's great to avoid duplication, but it shouldn't be
such ultimate goal that it cripples iptable->bpf conversion
just to reuse nft->bpf bits.



Re: [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer

2018-03-15 Thread Florian Westphal
Alexei Starovoitov  wrote:
> The way this IMR defined today looks pretty much like nft and
> it feels a bit too low level than iptable conversion would need.

It wasn't so much about a specific IMR but to avoid code duplication
between nft and iptables translators.

> I think it would be simpler to have user space only extensions
> and opcodes added to bpf for the purpose of the translation.
> Like there is no bpf instruction called 'load from IP header',
> but we can make one. Just extend extended bpf with an instruction
> like this and on the first pass do full conversion of nft
> directly into this 'extended extended bpf'.

I don't want to duplicate any ebpf conversion (and optimisations)
in the nft part.

If nft can be translated to this 'extended extended bpf' and
this then generates bpf code from nft input all is good.



Re: [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer

2018-03-15 Thread Alexei Starovoitov
On Tue, Mar 06, 2018 at 06:18:04PM +, Edward Cree wrote:
> On 06/03/18 18:03, Florian Westphal wrote:
> > I don't know.  I suspect we should go for naive algorithm only,
> > but I would defer such decision to Alexei/Daniel.
> >
> > f.e. i don't know if using llvm is a good idea or not,
> Yeah, I wondered about that too.  I think it's probably not a good idea,
>  since it's a big project with a complex interface and the tail would
>  likely wag the dog.
> 
> > I did not
> > intend to turn proposed imr into full blown compiler in any case,
> > I only want to avoid code duplication for iptables/nftables -> ebpf
> > translator.
> Well, I think anything that calling code does by hand will entail limits
>  on what the imr->ebpf step can do.  E.g. if imr uses registers by name
>  then the backend can't expand any highlevel constructs that need to use
>  scratch registers, unless some regs are reserved for the backend and
>  can't be used by the imr.
> 
> Ultimately I think the imr will have to become practically a compiler
>  back-end, because any jobs it punts to the caller will necessarily be
>  duplicated between iptables and nftables.

few years back we had a prototype of nft->bpf converter.
It was reading nft netlink, generating really dumb C code,
and feeding into clang/llvm. Performance of generated code was
surprisingly pretty good. llvm did particularly well combining
the same loads from multiple rules into single load and
combining all 'if' conditions into a few.
So full blown compiler approach known to work, but ...
we cannot use it here.
We cannot have bpfilter umh module depend on llvm.
The converter have to be fully contained within kernel source tree.
It's probably fine to start simple with compiler-like thingy and
later improve it with sophisticated optimizations and fancy
register allocator, but I suspect we can achieve better performance
and easier to maintain code if we specialize the converters.

The way this IMR defined today looks pretty much like nft and
it feels a bit too low level than iptable conversion would need.
Converting iptable to this IMR is like going old days of gcc
when C/C++ front-ends were generating RTL and all optimizations
were done on RTL representation. Eventually gcc had to adopt
gimple (variant of ssa) format to do optimizations that were
very hard to do with RTL.
I suspect if we go iptable->nft (same as this imr) the history
will repeat itself. It will work, but will be difficult to apply
interesting optimizations.

Similarly I don't quite see why this particular IMR is needed
for nft->bpf either.
I think it would be simpler to have user space only extensions
and opcodes added to bpf for the purpose of the translation.
Like there is no bpf instruction called 'load from IP header',
but we can make one. Just extend extended bpf with an instruction
like this and on the first pass do full conversion of nft
directly into this 'extended extended bpf'.
Then do second pass of lowering 'extended extended bpf' into eBPF
that kernel actually understands.
That would be similar to what llvm does when it lowers higher level
represenation into lower one.
The higher level IR doesn't need to be completely different
from lower level IR. Often lower level IR is a subset of higher level.

For iptable->bpf I think it's easier to construct decision tree
out of all iptable rules, optimize it as a tree and then
convert it into a set of map lookups glued with a bit of bpf code.
Converting iptable rules one by one isn't going to scale.
It was fine for prototype, but now we're talking production.
If there are 1k iptables rules and we generate just 4 bpf insns
for every rule, we will already hit 4k insn limit.
We can easily bump the limit, but even if it's 128k insns
it's still the limit. All iptables rules should be converted together.



Re: [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer

2018-03-06 Thread Edward Cree
On 06/03/18 18:03, Florian Westphal wrote:
> I don't know.  I suspect we should go for naive algorithm only,
> but I would defer such decision to Alexei/Daniel.
>
> f.e. i don't know if using llvm is a good idea or not,
Yeah, I wondered about that too.  I think it's probably not a good idea,
 since it's a big project with a complex interface and the tail would
 likely wag the dog.

> I did not
> intend to turn proposed imr into full blown compiler in any case,
> I only want to avoid code duplication for iptables/nftables -> ebpf
> translator.
Well, I think anything that calling code does by hand will entail limits
 on what the imr->ebpf step can do.  E.g. if imr uses registers by name
 then the backend can't expand any highlevel constructs that need to use
 scratch registers, unless some regs are reserved for the backend and
 can't be used by the imr.

Ultimately I think the imr will have to become practically a compiler
 back-end, because any jobs it punts to the caller will necessarily be
 duplicated between iptables and nftables.

-Ed


Re: [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer

2018-03-06 Thread Florian Westphal
Edward Cree  wrote:
> On 06/03/18 16:42, Florian Westphal wrote:
> > I would also add 'highlevel' objects that are themselves translated into
> > basic operations.  Most obvious example
> > are 'fetch 4 bytes x bytes into transport header'.
> >
> > Frontend should not need to bother with ipv4 details, such as ip
> > options.  Instead IMR should take care of this and generate the needed
> > instructions to fetch iph->ihl and figure out the correct transport
> > header offset.
> Presumably then for this the IMR regs will cease to have any connection to
>  BPF regs and will simply be (SSA?) r0, r1, ... as far as needed (not
>  limited to 10 regs like BPF)?  Then register allocation all happens in
>  the IMR->BPF conversion (even for things 64 bits or smaller).
> 
> I wonder how sophisticated we should be about register allocation; whether
>  we should go the whole hog with graph-colouring algorithms or linear
>  scan, or just do something naïve like an LRU.
> 
> Relatedly, should we spill values to the stack when we run out of
>  registers, or should we just rely on being able to rematerialise them
>  from parsing the packet again?

I don't know.  I suspect we should go for naive algorithm only,
but I would defer such decision to Alexei/Daniel.

f.e. i don't know if using llvm is a good idea or not, I did not
intend to turn proposed imr into full blown compiler in any case,
I only want to avoid code duplication for iptables/nftables -> ebpf
translator.


Re: [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer

2018-03-06 Thread Edward Cree
On 06/03/18 16:42, Florian Westphal wrote:
> I would also add 'highlevel' objects that are themselves translated into
> basic operations.  Most obvious example
> are 'fetch 4 bytes x bytes into transport header'.
>
> Frontend should not need to bother with ipv4 details, such as ip
> options.  Instead IMR should take care of this and generate the needed
> instructions to fetch iph->ihl and figure out the correct transport
> header offset.
Presumably then for this the IMR regs will cease to have any connection to
 BPF regs and will simply be (SSA?) r0, r1, ... as far as needed (not
 limited to 10 regs like BPF)?  Then register allocation all happens in
 the IMR->BPF conversion (even for things 64 bits or smaller).

I wonder how sophisticated we should be about register allocation; whether
 we should go the whole hog with graph-colouring algorithms or linear
 scan, or just do something naïve like an LRU.

Relatedly, should we spill values to the stack when we run out of
 registers, or should we just rely on being able to rematerialise them
 from parsing the packet again?

-Ed


Re: [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer

2018-03-06 Thread Florian Westphal
Daniel Borkmann  wrote:
> On 03/04/2018 08:40 PM, Florian Westphal wrote:
> > Its still in early stage, but I think its good enough as
> > a proof-of-concept.
> 
> be the critical part in that it needs to be flexible enough to cover
> both front ends well enough without having to make compromises to
> one of them.

Yes, thats the idea.

> The same would be for optimization passes e.g. when we
> know that two successive rules would match on TCP header bits that we
> can reuse the register that loaded/parsed it previously to that point.

Yes, I already thought about this.

First step would be to simply track the last accessed header base
(mac, nh, th) in the imr state so we can reuse the offset if possible.

> Similar when it comes to maps when the lookup value would need to
> propagate through the linked imr objects. Do you see such optimizations
> or in general propagation of state as direct part of the IMR or rather
> somewhat hidden in IMR layer when doing the IMR to BPF 'jit' phase?

Direct part.  I would only say that it should be hidden in the sense that
frontends don't need to do anything here.
I would do such propagation in the IMR before generating code.

The IMR has enough info to figure out:
1. number of (ebpf) registers that are needed simultaneously at most
2. how many payload requests relative to (mac/network/tcp header) there are
(both in total and in relation to rule count).

So it could e.g. decide to set aside 2 registers to contain the
th/nh offset to have them ready at all times, while not covering mac
header at all because its not used anywhere.

Likewise, we could even avoid for instance extension header parsing
as IMR would see that there are no 'IPv6+transport header' rules.

> Which other parts do you think would be needed for the IMR aside
> from above basic operations? ALU ops, JMPs for the relational ops?

We'll need at least following IMR features/objects (pseudo instructions):

- relops: gt, ge, lt etc.
- all bitops (shift, xor, and, etc)
- payload: allow stores instead of loads (this includes csum fixup)
- verdict: jumps to other chains, nfqueue verdict
- meta: (accesses to in/output interface names, skb length, and the like).
- register allocation for ipv6 addresses, interface names, etc (anything
larger than 64 bit).

I would also add 'highlevel' objects that are themselves translated into
basic operations.  Most obvious example
are 'fetch 4 bytes x bytes into transport header'.

Frontend should not need to bother with ipv4 details, such as ip
options.  Instead IMR should take care of this and generate the needed
instructions to fetch iph->ihl and figure out the correct transport
header offset.

It could do so by adding the needed part of the rule (fetch byte containing
ihl, mask/shift result, store the offset, etc.) internally.

I would also add:
- limit (to jit both xt_limit and nft limit)
- statistic (random sampling)

as both nft and iptables have those features and i think they're
pretty much the same so it can be handled by common code.
Same for others extensions but those have side effects and I don't
know yet how to handle that (log, counter and the like).

Same for set lookups.

In iptables case, they will only be 'address/network/interface is (not) in set',
i.e. only match/no match, and set manipulations (SET target).

In nftables case it can also be something like 'retrieve value
belonging to key (dreg = lookup(map, sreg, sreg_len), and then dreg
value gets used in other ways.

I think that by making the IMR also handle the nftables case rather than
iptables one has the advantage that the IMR might be able to optimize
repetitiive iptables rules (and nft rules that do not use jump map even
if they could).

iptables .. -i eth0 -j ETH0_RULES
iptables .. -i eth1 -j ETH1_RULES
..
iptables .. -i ethX -j ETHX_RULES

could be transparently thrown into a map and the jump target retrieved
from it base on iifname...

The IMR would also need to be able to create/allocate such sets/maps
on its own.

Do you see this as something that has to be decided now?
Ideally feeling about IMR would be that we can probably handle this
later by extending it rather than having to throw it away plus a
rewrite :)

> I think it would be good to have clear semantics in terms of what
> it would eventually abstract away from raw BPF when sitting on top
> of it; potentially these could be details on packet access or

[..]

Yes, it would be good if IMR could e.g. do merge of adjacent loads:

load 4 bytes from network header at offset 12  (relop, equal) imm 0x1
load 4 bytes from network header at offset 16  (relop, equal) imm 0x2

 ..  would be condensed to ...

load 8 bytes from network header at offset 12  (relop, equal) imm 0x10002

... before generating ebpf code.

I don't think this should be part of frontend (too lowlevel task for
frontend) and I don't see why it should happen when generating code (too high
level task for that).

> interaction with helpers or other 

Re: [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer

2018-03-06 Thread Daniel Borkmann
On 03/04/2018 08:40 PM, Florian Westphal wrote:
> These patches, which go on top of the 'bpfilter' RFC patches,
> demonstrate an nftables to ebpf translation (done in userspace).
> In order to not duplicate the ebpf code generation efforts, the rules
> 
> iptables -i lo -d 127.0.0.2 -j DROP
> and
> nft add rule ip filter input ip daddr 127.0.0.2 drop
> 
> are first translated to a common intermediate representation, and then
> to ebpf, which attaches resulting prog to the XDP hook.
> 
> IMR representation is identical in both cases so therefore both
> rules result in the same ebpf program.
> 
> The IMR currently assumes that translation will always be to ebpf.
> As per previous discussion it doesn't consider other targets, so
> for instance IMR pseudo-registers map 1:1 to ebpf ones.
> 
> The IMR is also supposed to be generic enough to make it easy to convert
> 'fronted' formats (iptables rule blob, nftables netlink) to it, and
> also extend it to cover ip rule, ovs or any other inputs in the future
> without need for major changes to the IMR.
> 
> The IMR currently implements following basic operations:
>  - Relational (equal, not equal)
>  - immediates (32 and 64bit constants)
>  - payload with relative addressing (macr, network, transport header)
>  - verdict (pass, drop, next rule)
> 
> Its still in early stage, but I think its good enough as
> a proof-of-concept.

Thanks a lot for working on this! Overall I like the PoC and the
underlying idea of it! I think the design of such IMR would indeed
be the critical part in that it needs to be flexible enough to cover
both front ends well enough without having to make compromises to
one of them. The same would be for optimization passes e.g. when we
know that two successive rules would match on TCP header bits that we
can reuse the register that loaded/parsed it previously to that point.
Similar when it comes to maps when the lookup value would need to
propagate through the linked imr objects. Do you see such optimizations
or in general propagation of state as direct part of the IMR or rather
somewhat hidden in IMR layer when doing the IMR to BPF 'jit' phase?

Which other parts do you think would be needed for the IMR aside
from above basic operations? ALU ops, JMPs for the relational ops?
I think it would be good to have clear semantics in terms of what
it would eventually abstract away from raw BPF when sitting on top
of it; potentially these could be details on packet access or
interaction with helpers or other BPF features such that it's BPF prog
type independent at this stage, e.g. another point could be that given
the priv/cap level of the uapi request, there could also be different
BPF code gen backends that implement against the IMR, e.g. when the
request comes out of userns then it has feature constraints in terms
of e.g. having to use bpf_skb_{load,store}_bytes() helpers for packet
access instead of direct packet access or not being able to use BPF to
BPF calls, etc; wdyt?

> Known differences between nftjit.ko and bpfilter.ko:
> nftjit.ko currently doesn't run transparently, but thats only
> because I wanted to focus on the IMR and get the POC out of the door.
> 
> It should be possible to get it transparent via the bpfilter.ko approach.
> 
> Next steps for the IMR could be addition of binary operations for
> prefixes ("-d 192.168.0.1/24"), its also needed e.g. for tcp flag
> matching (-p tcp --syn in iptables) and so on.
> 
> I'd also be interested in wheter XDP is seen as appropriate
> target hook.  AFAICS the XDP and the nftables ingress hook are similar
> enough to consider just (re)using the XDP hook to jit the nftables ingress
> hook.  The translator could check if the hook is unused, and return
> early if some other program is already attached.
> 
> Comments welcome, especially wrt. IMR concept and what might be
> next step(s) in moving forward.
> 
> The patches are also available via git at
> https://git.breakpoint.cc/cgit/fw/net-next.git/log/?h=bpfilter7 .

Thanks,
Daniel