Re: [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer
On Thu, Mar 15, 2018 at 06:00:22PM +0100, Florian Westphal wrote: > Alexei Starovoitov wrote: > > The way this IMR defined today looks pretty much like nft and > > it feels a bit too low level than iptable conversion would need. > > It wasn't so much about a specific IMR but to avoid code duplication > between nft and iptables translators. > > > I think it would be simpler to have user space only extensions > > and opcodes added to bpf for the purpose of the translation. > > Like there is no bpf instruction called 'load from IP header', > > but we can make one. Just extend extended bpf with an instruction > > like this and on the first pass do full conversion of nft > > directly into this 'extended extended bpf'. > > I don't want to duplicate any ebpf conversion (and optimisations) > in the nft part. > > If nft can be translated to this 'extended extended bpf' and > this then generates bpf code from nft input all is good. if possible it's great to avoid duplication, but it shouldn't be such ultimate goal that it cripples iptable->bpf conversion just to reuse nft->bpf bits.
Re: [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer
Alexei Starovoitov wrote: > The way this IMR defined today looks pretty much like nft and > it feels a bit too low level than iptable conversion would need. It wasn't so much about a specific IMR but to avoid code duplication between nft and iptables translators. > I think it would be simpler to have user space only extensions > and opcodes added to bpf for the purpose of the translation. > Like there is no bpf instruction called 'load from IP header', > but we can make one. Just extend extended bpf with an instruction > like this and on the first pass do full conversion of nft > directly into this 'extended extended bpf'. I don't want to duplicate any ebpf conversion (and optimisations) in the nft part. If nft can be translated to this 'extended extended bpf' and this then generates bpf code from nft input all is good.
Re: [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer
On Tue, Mar 06, 2018 at 06:18:04PM +, Edward Cree wrote: > On 06/03/18 18:03, Florian Westphal wrote: > > I don't know. I suspect we should go for naive algorithm only, > > but I would defer such decision to Alexei/Daniel. > > > > f.e. i don't know if using llvm is a good idea or not, > Yeah, I wondered about that too. I think it's probably not a good idea, > since it's a big project with a complex interface and the tail would > likely wag the dog. > > > I did not > > intend to turn proposed imr into full blown compiler in any case, > > I only want to avoid code duplication for iptables/nftables -> ebpf > > translator. > Well, I think anything that calling code does by hand will entail limits > on what the imr->ebpf step can do. E.g. if imr uses registers by name > then the backend can't expand any highlevel constructs that need to use > scratch registers, unless some regs are reserved for the backend and > can't be used by the imr. > > Ultimately I think the imr will have to become practically a compiler > back-end, because any jobs it punts to the caller will necessarily be > duplicated between iptables and nftables. few years back we had a prototype of nft->bpf converter. It was reading nft netlink, generating really dumb C code, and feeding into clang/llvm. Performance of generated code was surprisingly pretty good. llvm did particularly well combining the same loads from multiple rules into single load and combining all 'if' conditions into a few. So full blown compiler approach known to work, but ... we cannot use it here. We cannot have bpfilter umh module depend on llvm. The converter have to be fully contained within kernel source tree. It's probably fine to start simple with compiler-like thingy and later improve it with sophisticated optimizations and fancy register allocator, but I suspect we can achieve better performance and easier to maintain code if we specialize the converters. The way this IMR defined today looks pretty much like nft and it feels a bit too low level than iptable conversion would need. Converting iptable to this IMR is like going old days of gcc when C/C++ front-ends were generating RTL and all optimizations were done on RTL representation. Eventually gcc had to adopt gimple (variant of ssa) format to do optimizations that were very hard to do with RTL. I suspect if we go iptable->nft (same as this imr) the history will repeat itself. It will work, but will be difficult to apply interesting optimizations. Similarly I don't quite see why this particular IMR is needed for nft->bpf either. I think it would be simpler to have user space only extensions and opcodes added to bpf for the purpose of the translation. Like there is no bpf instruction called 'load from IP header', but we can make one. Just extend extended bpf with an instruction like this and on the first pass do full conversion of nft directly into this 'extended extended bpf'. Then do second pass of lowering 'extended extended bpf' into eBPF that kernel actually understands. That would be similar to what llvm does when it lowers higher level represenation into lower one. The higher level IR doesn't need to be completely different from lower level IR. Often lower level IR is a subset of higher level. For iptable->bpf I think it's easier to construct decision tree out of all iptable rules, optimize it as a tree and then convert it into a set of map lookups glued with a bit of bpf code. Converting iptable rules one by one isn't going to scale. It was fine for prototype, but now we're talking production. If there are 1k iptables rules and we generate just 4 bpf insns for every rule, we will already hit 4k insn limit. We can easily bump the limit, but even if it's 128k insns it's still the limit. All iptables rules should be converted together.
Re: [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer
On 06/03/18 18:03, Florian Westphal wrote: > I don't know. I suspect we should go for naive algorithm only, > but I would defer such decision to Alexei/Daniel. > > f.e. i don't know if using llvm is a good idea or not, Yeah, I wondered about that too. I think it's probably not a good idea, since it's a big project with a complex interface and the tail would likely wag the dog. > I did not > intend to turn proposed imr into full blown compiler in any case, > I only want to avoid code duplication for iptables/nftables -> ebpf > translator. Well, I think anything that calling code does by hand will entail limits on what the imr->ebpf step can do. E.g. if imr uses registers by name then the backend can't expand any highlevel constructs that need to use scratch registers, unless some regs are reserved for the backend and can't be used by the imr. Ultimately I think the imr will have to become practically a compiler back-end, because any jobs it punts to the caller will necessarily be duplicated between iptables and nftables. -Ed
Re: [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer
Edward Cree wrote: > On 06/03/18 16:42, Florian Westphal wrote: > > I would also add 'highlevel' objects that are themselves translated into > > basic operations. Most obvious example > > are 'fetch 4 bytes x bytes into transport header'. > > > > Frontend should not need to bother with ipv4 details, such as ip > > options. Instead IMR should take care of this and generate the needed > > instructions to fetch iph->ihl and figure out the correct transport > > header offset. > Presumably then for this the IMR regs will cease to have any connection to > BPF regs and will simply be (SSA?) r0, r1, ... as far as needed (not > limited to 10 regs like BPF)? Then register allocation all happens in > the IMR->BPF conversion (even for things 64 bits or smaller). > > I wonder how sophisticated we should be about register allocation; whether > we should go the whole hog with graph-colouring algorithms or linear > scan, or just do something naïve like an LRU. > > Relatedly, should we spill values to the stack when we run out of > registers, or should we just rely on being able to rematerialise them > from parsing the packet again? I don't know. I suspect we should go for naive algorithm only, but I would defer such decision to Alexei/Daniel. f.e. i don't know if using llvm is a good idea or not, I did not intend to turn proposed imr into full blown compiler in any case, I only want to avoid code duplication for iptables/nftables -> ebpf translator.
Re: [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer
On 06/03/18 16:42, Florian Westphal wrote: > I would also add 'highlevel' objects that are themselves translated into > basic operations. Most obvious example > are 'fetch 4 bytes x bytes into transport header'. > > Frontend should not need to bother with ipv4 details, such as ip > options. Instead IMR should take care of this and generate the needed > instructions to fetch iph->ihl and figure out the correct transport > header offset. Presumably then for this the IMR regs will cease to have any connection to BPF regs and will simply be (SSA?) r0, r1, ... as far as needed (not limited to 10 regs like BPF)? Then register allocation all happens in the IMR->BPF conversion (even for things 64 bits or smaller). I wonder how sophisticated we should be about register allocation; whether we should go the whole hog with graph-colouring algorithms or linear scan, or just do something naïve like an LRU. Relatedly, should we spill values to the stack when we run out of registers, or should we just rely on being able to rematerialise them from parsing the packet again? -Ed
Re: [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer
Daniel Borkmann wrote: > On 03/04/2018 08:40 PM, Florian Westphal wrote: > > Its still in early stage, but I think its good enough as > > a proof-of-concept. > > be the critical part in that it needs to be flexible enough to cover > both front ends well enough without having to make compromises to > one of them. Yes, thats the idea. > The same would be for optimization passes e.g. when we > know that two successive rules would match on TCP header bits that we > can reuse the register that loaded/parsed it previously to that point. Yes, I already thought about this. First step would be to simply track the last accessed header base (mac, nh, th) in the imr state so we can reuse the offset if possible. > Similar when it comes to maps when the lookup value would need to > propagate through the linked imr objects. Do you see such optimizations > or in general propagation of state as direct part of the IMR or rather > somewhat hidden in IMR layer when doing the IMR to BPF 'jit' phase? Direct part. I would only say that it should be hidden in the sense that frontends don't need to do anything here. I would do such propagation in the IMR before generating code. The IMR has enough info to figure out: 1. number of (ebpf) registers that are needed simultaneously at most 2. how many payload requests relative to (mac/network/tcp header) there are (both in total and in relation to rule count). So it could e.g. decide to set aside 2 registers to contain the th/nh offset to have them ready at all times, while not covering mac header at all because its not used anywhere. Likewise, we could even avoid for instance extension header parsing as IMR would see that there are no 'IPv6+transport header' rules. > Which other parts do you think would be needed for the IMR aside > from above basic operations? ALU ops, JMPs for the relational ops? We'll need at least following IMR features/objects (pseudo instructions): - relops: gt, ge, lt etc. - all bitops (shift, xor, and, etc) - payload: allow stores instead of loads (this includes csum fixup) - verdict: jumps to other chains, nfqueue verdict - meta: (accesses to in/output interface names, skb length, and the like). - register allocation for ipv6 addresses, interface names, etc (anything larger than 64 bit). I would also add 'highlevel' objects that are themselves translated into basic operations. Most obvious example are 'fetch 4 bytes x bytes into transport header'. Frontend should not need to bother with ipv4 details, such as ip options. Instead IMR should take care of this and generate the needed instructions to fetch iph->ihl and figure out the correct transport header offset. It could do so by adding the needed part of the rule (fetch byte containing ihl, mask/shift result, store the offset, etc.) internally. I would also add: - limit (to jit both xt_limit and nft limit) - statistic (random sampling) as both nft and iptables have those features and i think they're pretty much the same so it can be handled by common code. Same for others extensions but those have side effects and I don't know yet how to handle that (log, counter and the like). Same for set lookups. In iptables case, they will only be 'address/network/interface is (not) in set', i.e. only match/no match, and set manipulations (SET target). In nftables case it can also be something like 'retrieve value belonging to key (dreg = lookup(map, sreg, sreg_len), and then dreg value gets used in other ways. I think that by making the IMR also handle the nftables case rather than iptables one has the advantage that the IMR might be able to optimize repetitiive iptables rules (and nft rules that do not use jump map even if they could). iptables .. -i eth0 -j ETH0_RULES iptables .. -i eth1 -j ETH1_RULES .. iptables .. -i ethX -j ETHX_RULES could be transparently thrown into a map and the jump target retrieved from it base on iifname... The IMR would also need to be able to create/allocate such sets/maps on its own. Do you see this as something that has to be decided now? Ideally feeling about IMR would be that we can probably handle this later by extending it rather than having to throw it away plus a rewrite :) > I think it would be good to have clear semantics in terms of what > it would eventually abstract away from raw BPF when sitting on top > of it; potentially these could be details on packet access or [..] Yes, it would be good if IMR could e.g. do merge of adjacent loads: load 4 bytes from network header at offset 12 (relop, equal) imm 0x1 load 4 bytes from network header at offset 16 (relop, equal) imm 0x2 .. would be condensed to ... load 8 bytes from network header at offset 12 (relop, equal) imm 0x10002 ... before generating ebpf code. I don't think this should be part of frontend (too lowlevel task for frontend) and I don't see why it should happen when generating code (too high level task for that). > interaction with helpers or other BPF features such that it
Re: [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer
On 03/04/2018 08:40 PM, Florian Westphal wrote: > These patches, which go on top of the 'bpfilter' RFC patches, > demonstrate an nftables to ebpf translation (done in userspace). > In order to not duplicate the ebpf code generation efforts, the rules > > iptables -i lo -d 127.0.0.2 -j DROP > and > nft add rule ip filter input ip daddr 127.0.0.2 drop > > are first translated to a common intermediate representation, and then > to ebpf, which attaches resulting prog to the XDP hook. > > IMR representation is identical in both cases so therefore both > rules result in the same ebpf program. > > The IMR currently assumes that translation will always be to ebpf. > As per previous discussion it doesn't consider other targets, so > for instance IMR pseudo-registers map 1:1 to ebpf ones. > > The IMR is also supposed to be generic enough to make it easy to convert > 'fronted' formats (iptables rule blob, nftables netlink) to it, and > also extend it to cover ip rule, ovs or any other inputs in the future > without need for major changes to the IMR. > > The IMR currently implements following basic operations: > - Relational (equal, not equal) > - immediates (32 and 64bit constants) > - payload with relative addressing (macr, network, transport header) > - verdict (pass, drop, next rule) > > Its still in early stage, but I think its good enough as > a proof-of-concept. Thanks a lot for working on this! Overall I like the PoC and the underlying idea of it! I think the design of such IMR would indeed be the critical part in that it needs to be flexible enough to cover both front ends well enough without having to make compromises to one of them. The same would be for optimization passes e.g. when we know that two successive rules would match on TCP header bits that we can reuse the register that loaded/parsed it previously to that point. Similar when it comes to maps when the lookup value would need to propagate through the linked imr objects. Do you see such optimizations or in general propagation of state as direct part of the IMR or rather somewhat hidden in IMR layer when doing the IMR to BPF 'jit' phase? Which other parts do you think would be needed for the IMR aside from above basic operations? ALU ops, JMPs for the relational ops? I think it would be good to have clear semantics in terms of what it would eventually abstract away from raw BPF when sitting on top of it; potentially these could be details on packet access or interaction with helpers or other BPF features such that it's BPF prog type independent at this stage, e.g. another point could be that given the priv/cap level of the uapi request, there could also be different BPF code gen backends that implement against the IMR, e.g. when the request comes out of userns then it has feature constraints in terms of e.g. having to use bpf_skb_{load,store}_bytes() helpers for packet access instead of direct packet access or not being able to use BPF to BPF calls, etc; wdyt? > Known differences between nftjit.ko and bpfilter.ko: > nftjit.ko currently doesn't run transparently, but thats only > because I wanted to focus on the IMR and get the POC out of the door. > > It should be possible to get it transparent via the bpfilter.ko approach. > > Next steps for the IMR could be addition of binary operations for > prefixes ("-d 192.168.0.1/24"), its also needed e.g. for tcp flag > matching (-p tcp --syn in iptables) and so on. > > I'd also be interested in wheter XDP is seen as appropriate > target hook. AFAICS the XDP and the nftables ingress hook are similar > enough to consider just (re)using the XDP hook to jit the nftables ingress > hook. The translator could check if the hook is unused, and return > early if some other program is already attached. > > Comments welcome, especially wrt. IMR concept and what might be > next step(s) in moving forward. > > The patches are also available via git at > https://git.breakpoint.cc/cgit/fw/net-next.git/log/?h=bpfilter7 . Thanks, Daniel
[RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer
These patches, which go on top of the 'bpfilter' RFC patches, demonstrate an nftables to ebpf translation (done in userspace). In order to not duplicate the ebpf code generation efforts, the rules iptables -i lo -d 127.0.0.2 -j DROP and nft add rule ip filter input ip daddr 127.0.0.2 drop are first translated to a common intermediate representation, and then to ebpf, which attaches resulting prog to the XDP hook. IMR representation is identical in both cases so therefore both rules result in the same ebpf program. The IMR currently assumes that translation will always be to ebpf. As per previous discussion it doesn't consider other targets, so for instance IMR pseudo-registers map 1:1 to ebpf ones. The IMR is also supposed to be generic enough to make it easy to convert 'fronted' formats (iptables rule blob, nftables netlink) to it, and also extend it to cover ip rule, ovs or any other inputs in the future without need for major changes to the IMR. The IMR currently implements following basic operations: - Relational (equal, not equal) - immediates (32 and 64bit constants) - payload with relative addressing (macr, network, transport header) - verdict (pass, drop, next rule) Its still in early stage, but I think its good enough as a proof-of-concept. Known differences between nftjit.ko and bpfilter.ko: nftjit.ko currently doesn't run transparently, but thats only because I wanted to focus on the IMR and get the POC out of the door. It should be possible to get it transparent via the bpfilter.ko approach. Next steps for the IMR could be addition of binary operations for prefixes ("-d 192.168.0.1/24"), its also needed e.g. for tcp flag matching (-p tcp --syn in iptables) and so on. I'd also be interested in wheter XDP is seen as appropriate target hook. AFAICS the XDP and the nftables ingress hook are similar enough to consider just (re)using the XDP hook to jit the nftables ingress hook. The translator could check if the hook is unused, and return early if some other program is already attached. Comments welcome, especially wrt. IMR concept and what might be next step(s) in moving forward. The patches are also available via git at https://git.breakpoint.cc/cgit/fw/net-next.git/log/?h=bpfilter7 .