Re: [flamebait] xdp, well meaning but pointless
On Sat, 3 Dec 2016 11:48:22 -0800 John Fastabendwrote: > On 16-12-03 08:19 AM, Willem de Bruijn wrote: > > On Fri, Dec 2, 2016 at 12:22 PM, Jesper Dangaard Brouer > > wrote: > >> > >> On Thu, 1 Dec 2016 10:11:08 +0100 Florian Westphal wrote: > >> > >>> In light of DPDKs existence it make a lot more sense to me to provide > >>> a). a faster mmap based interface (possibly AF_PACKET based) that allows > >>> to map nic directly into userspace, detaching tx/rx queue from kernel. > >>> > >>> John Fastabend sent something like this last year as a proof of > >>> concept, iirc it was rejected because register space got exposed directly > >>> to userspace. I think we should re-consider merging netmap > >>> (or something conceptually close to its design). > >> > >> I'm actually working in this direction, of zero-copy RX mapping packets > >> into userspace. This work is mostly related to page_pool, and I only > >> plan to use XDP as a filter for selecting packets going to userspace, > >> as this choice need to be taken very early. > >> > >> My design is here: > >> > >> https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html > >> > >> This is mostly about changing the memory model in the drivers, to allow > >> for safely mapping pages to userspace. (An efficient queue mechanism is > >> not covered). > > > > Virtio virtqueues are used in various other locations in the stack. > > With separate memory pools and send + completion descriptor rings, > > signal moderation, careful avoidance of cacheline bouncing, etc. these > > seem like a good opportunity for a TPACKET_V4 format. > > > > FWIW. After we rejected exposing the register space to user space due to > valid security issues we fell back to using VFIO which works nicely for > mapping virtual functions into userspace and VMs. The main drawback is > user space has to manage the VF but that is mostly a solved problem at > this point. Deployment concerns aside. Using VFs (PCIe SR-IOV Virtual Functions) solves this in a completely different orthogonal way. To me it is still like taking over the entire NIC, although you use HW to split the traffic into VFs. Setup for VF deployment still looks troubling like 1G hugepages and vfio enable_unsafe_noiommu_mode=1. And generally getting SR-IOV working on your HW is a task of it's own. One thing people often seem to miss with SR-IOV VFs is that VM-to-VM traffic will be limited by PCIe bandwidth and transaction overheads. Like Stepen Hemminger demonstrated[1] at NetDev 1.2 and Luigi also have a paper demonstrating this (AFAICR). [1] http://netdevconf.org/1.2/session.html?stephen-hemminger A key difference in my design is to, allow the NIC to be shared in a safe manor. The NIC functions 100% as a normal Linux controlled NIC. The catch is that once an application request zero-copy RX, then the NIC might have to reconfigure it's RX-ring usage. As the driver MUST change into what I call the "read-only packet page" mode, which actually is the default in many drivers today. > There was a TPACKET_V4 version we had a prototype of that passed > buffers down to the hardware to use with the dma engine. This gives > zero-copy but same as VFs requires the hardware to do all the steering > of traffic and any expected policy in front of the application. Due to > requiring user space to kick hardware and vice versa though it was > somewhat slower so I didn't finish it up. The kick was implemented as a > syscall iirc. I can maybe look at it a bit more next week and see if its > worth reviving now in this context. This is still at the design stage. The target here is that the page_pool and driver adjustments will provide the basis for building RX zero-copy solutions in a memory safe manor. I do see tcpdump/RAW packet access like TPACKET_V4 being one of the first users of this. Not the only user, as further down the road, I also imagine RX zero-copy delivery into sockets (and perhaps combined with a "raw_demux" step that doesn't alloc the SKB, which Tom hinted in the other thread for UDP delivery). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [flamebait] xdp, well meaning but pointless
On 16-12-03 08:19 AM, Willem de Bruijn wrote: > On Fri, Dec 2, 2016 at 12:22 PM, Jesper Dangaard Brouer >wrote: >> >> On Thu, 1 Dec 2016 10:11:08 +0100 Florian Westphal wrote: >> >>> In light of DPDKs existence it make a lot more sense to me to provide >>> a). a faster mmap based interface (possibly AF_PACKET based) that allows >>> to map nic directly into userspace, detaching tx/rx queue from kernel. >>> >>> John Fastabend sent something like this last year as a proof of >>> concept, iirc it was rejected because register space got exposed directly >>> to userspace. I think we should re-consider merging netmap >>> (or something conceptually close to its design). >> >> I'm actually working in this direction, of zero-copy RX mapping packets >> into userspace. This work is mostly related to page_pool, and I only >> plan to use XDP as a filter for selecting packets going to userspace, >> as this choice need to be taken very early. >> >> My design is here: >> >> https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html >> >> This is mostly about changing the memory model in the drivers, to allow >> for safely mapping pages to userspace. (An efficient queue mechanism is >> not covered). > > Virtio virtqueues are used in various other locations in the stack. > With separate memory pools and send + completion descriptor rings, > signal moderation, careful avoidance of cacheline bouncing, etc. these > seem like a good opportunity for a TPACKET_V4 format. > FWIW. After we rejected exposing the register space to user space due to valid security issues we fell back to using VFIO which works nicely for mapping virtual functions into userspace and VMs. The main drawback is user space has to manage the VF but that is mostly a solved problem at this point. Deployment concerns aside. There was a TPACKET_V4 version we had a prototype of that passed buffers down to the hardware to use with the dma engine. This gives zero-copy but same as VFs requires the hardware to do all the steering of traffic and any expected policy in front of the application. Due to requiring user space to kick hardware and vice versa though it was somewhat slower so I didn't finish it up. The kick was implemented as a syscall iirc. I can maybe look at it a bit more next week and see if its worth reviving now in this context. I don't think any of this requires page pools though. Or rather tpacket and vhost/virtio already know how to do page pools is perhaps the other way to look at it. One idea I've been playing around with is a vhost backend using tpacketv{3|4} so we don't require socket manipulation. Thanks, John
Re: [flamebait] xdp, well meaning but pointless
On Fri, Dec 2, 2016 at 12:22 PM, Jesper Dangaard Brouerwrote: > > On Thu, 1 Dec 2016 10:11:08 +0100 Florian Westphal wrote: > >> In light of DPDKs existence it make a lot more sense to me to provide >> a). a faster mmap based interface (possibly AF_PACKET based) that allows >> to map nic directly into userspace, detaching tx/rx queue from kernel. >> >> John Fastabend sent something like this last year as a proof of >> concept, iirc it was rejected because register space got exposed directly >> to userspace. I think we should re-consider merging netmap >> (or something conceptually close to its design). > > I'm actually working in this direction, of zero-copy RX mapping packets > into userspace. This work is mostly related to page_pool, and I only > plan to use XDP as a filter for selecting packets going to userspace, > as this choice need to be taken very early. > > My design is here: > > https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html > > This is mostly about changing the memory model in the drivers, to allow > for safely mapping pages to userspace. (An efficient queue mechanism is > not covered). Virtio virtqueues are used in various other locations in the stack. With separate memory pools and send + completion descriptor rings, signal moderation, careful avoidance of cacheline bouncing, etc. these seem like a good opportunity for a TPACKET_V4 format.
Re: [flamebait] xdp, well meaning but pointless
On Fri, Dec 2, 2016 at 11:56 AM, Stephen Hemmingerwrote: > On Fri, 2 Dec 2016 19:12:00 +0100 > Hannes Frederic Sowa wrote: > >> On 02.12.2016 17:59, Tom Herbert wrote: >> > On Fri, Dec 2, 2016 at 3:54 AM, Hannes Frederic Sowa >> > wrote: >> >> On 02.12.2016 11:24, Jesper Dangaard Brouer wrote: >> >>> On Thu, 1 Dec 2016 13:51:32 -0800 >> >>> Tom Herbert wrote: >> >>> >> >> The technical plenary at last IETF on Seoul a couple of weeks ago was >> >> exclusively focussed on DDOS in light of the recent attack against >> >> Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare >> >> presentation by Nick Sullivan >> >> (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf) >> >> alluded to some implementation of DDOS mitigation. In particular, on >> >> slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel" >> >>> >> >>> slide 14 >> >>> >> >> numbers he gave we're based in iptables+BPF and that was a whole >> >> 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic >> >> and that's also when I introduced XDP to whole IETF :-) ). If that's >> >> the best we can do the Internet is in a world hurt. DDOS mitigation >> >> alone is probably a sufficient motivation to look at XDP. We need >> >> something that drops bad packets as quickly as possible when under >> >> attack, we need this to be integrated into the stack, we need it to be >> >> programmable to deal with the increasing savvy of attackers, and we >> >> don't want to be forced to be dependent on HW solutions. This is why >> >> we created XDP! >> >>> >> >>> The 1.2Mpps number is a bit low, but we are unfortunately in that >> >>> ballpark. >> >>> >> > I totally understand that. But in my reply to David in this thread I >> > mentioned DNS apex processing as being problematic which is actually >> > being referred in your linked slide deck on page 9 ("What do floods >> > look >> > like") and the problematic of parsing DNS packets in XDP due to string >> > processing and looping inside eBPF. >> >>> >> >>> That is a weak argument. You do realize CloudFlare actually use eBPF to >> >>> do this exact filtering, and (so-far) eBPF for parsing DNS have been >> >>> sufficient for them. >> >> >> >> You are talking about this code on the following slides (I actually >> >> transcribed it for you here and disassembled): >> >> >> >> l0: ld #0x14 >> >> l1: ldxb 4*([0]&0xf) >> >> l2: add x >> >> l3: tax >> >> l4: ld [x+0] >> >> l5: jeq #0x7657861, l6, l13 >> >> l6: ld [x+4] >> >> l7: jeq #0x6d706c65, l8, l13 >> >> l8: ld [x+8] >> >> l9: jeq #0x3636f6d, l10, l13 >> >> l10:ldb [x+12] >> >> l11:jeq #0, l12, l13 >> >> l12:ret #0x1 >> >> l13:ret #0 >> >> >> >> You can offload this to u32 in hardware if that is what you want. >> >> >> >> The reason this works is because of netfilter, which allows them to >> >> dynamically generate BPF programs and insert and delete them from >> >> chains, do intersection or unions of them. >> >> >> >> If you have a freestanding program like in XDP the complexity space is a >> >> different one and not comparable to this at all. >> >> >> > I don't understand this comment about complexity especially in regards >> > to the idea of offloading u32 to hardware. Relying on hardware to do >> > anything always leads to more complexity than an equivalent SW >> > implementation for the same functionality. The only reason we ever use >> > a hardware mechanisms is if it gives *significantly* better >> > performance. If the performance difference isn't there then doing >> > things in SW is going to be the better path (as we see in XDP). >> >> I am just wondering why the u32 filter wasn't mentioned in their slide >> deck. If all what Cloudflare needs are those kind of matches, they are >> in fact actually easier to generate than an cBPF program. It is not a >> good example of how a real world DoS filter in XDP would look like. >> >> If you argue XDP as a C function hook that can call arbitrary code in >> the driver before submitting that to the networking stack, yep, that is >> not complex at all. Depending on how those modules will be maintained, >> they either end up in the kernel and will be updated on major changes or >> are 3rd party and people have to update them and also depend on the >> driver features. >> >> But this opens up a whole new can of worms also. I haven't really >> thought this through completely, but last time the patches were nack'ed >> with lots of strong opinions and I tended to agree with them. I am >> revisiting this position. >> >> Certainly you can build real-world DoS protection with this function >> pointer hook and C code in the driver. In this
Re: [flamebait] xdp, well meaning but pointless
On Fri, 2 Dec 2016 19:12:00 +0100 Hannes Frederic Sowawrote: > On 02.12.2016 17:59, Tom Herbert wrote: > > On Fri, Dec 2, 2016 at 3:54 AM, Hannes Frederic Sowa > > wrote: > >> On 02.12.2016 11:24, Jesper Dangaard Brouer wrote: > >>> On Thu, 1 Dec 2016 13:51:32 -0800 > >>> Tom Herbert wrote: > >>> > >> The technical plenary at last IETF on Seoul a couple of weeks ago was > >> exclusively focussed on DDOS in light of the recent attack against > >> Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare > >> presentation by Nick Sullivan > >> (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf) > >> alluded to some implementation of DDOS mitigation. In particular, on > >> slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel" > >>> > >>> slide 14 > >>> > >> numbers he gave we're based in iptables+BPF and that was a whole > >> 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic > >> and that's also when I introduced XDP to whole IETF :-) ). If that's > >> the best we can do the Internet is in a world hurt. DDOS mitigation > >> alone is probably a sufficient motivation to look at XDP. We need > >> something that drops bad packets as quickly as possible when under > >> attack, we need this to be integrated into the stack, we need it to be > >> programmable to deal with the increasing savvy of attackers, and we > >> don't want to be forced to be dependent on HW solutions. This is why > >> we created XDP! > >>> > >>> The 1.2Mpps number is a bit low, but we are unfortunately in that > >>> ballpark. > >>> > > I totally understand that. But in my reply to David in this thread I > > mentioned DNS apex processing as being problematic which is actually > > being referred in your linked slide deck on page 9 ("What do floods look > > like") and the problematic of parsing DNS packets in XDP due to string > > processing and looping inside eBPF. > >>> > >>> That is a weak argument. You do realize CloudFlare actually use eBPF to > >>> do this exact filtering, and (so-far) eBPF for parsing DNS have been > >>> sufficient for them. > >> > >> You are talking about this code on the following slides (I actually > >> transcribed it for you here and disassembled): > >> > >> l0: ld #0x14 > >> l1: ldxb 4*([0]&0xf) > >> l2: add x > >> l3: tax > >> l4: ld [x+0] > >> l5: jeq #0x7657861, l6, l13 > >> l6: ld [x+4] > >> l7: jeq #0x6d706c65, l8, l13 > >> l8: ld [x+8] > >> l9: jeq #0x3636f6d, l10, l13 > >> l10:ldb [x+12] > >> l11:jeq #0, l12, l13 > >> l12:ret #0x1 > >> l13:ret #0 > >> > >> You can offload this to u32 in hardware if that is what you want. > >> > >> The reason this works is because of netfilter, which allows them to > >> dynamically generate BPF programs and insert and delete them from > >> chains, do intersection or unions of them. > >> > >> If you have a freestanding program like in XDP the complexity space is a > >> different one and not comparable to this at all. > >> > > I don't understand this comment about complexity especially in regards > > to the idea of offloading u32 to hardware. Relying on hardware to do > > anything always leads to more complexity than an equivalent SW > > implementation for the same functionality. The only reason we ever use > > a hardware mechanisms is if it gives *significantly* better > > performance. If the performance difference isn't there then doing > > things in SW is going to be the better path (as we see in XDP). > > I am just wondering why the u32 filter wasn't mentioned in their slide > deck. If all what Cloudflare needs are those kind of matches, they are > in fact actually easier to generate than an cBPF program. It is not a > good example of how a real world DoS filter in XDP would look like. > > If you argue XDP as a C function hook that can call arbitrary code in > the driver before submitting that to the networking stack, yep, that is > not complex at all. Depending on how those modules will be maintained, > they either end up in the kernel and will be updated on major changes or > are 3rd party and people have to update them and also depend on the > driver features. > > But this opens up a whole new can of worms also. I haven't really > thought this through completely, but last time the patches were nack'ed > with lots of strong opinions and I tended to agree with them. I am > revisiting this position. > > Certainly you can build real-world DoS protection with this function > pointer hook and C code in the driver. In this case a user space > solution still has advantages because of maintainability, as e.g. with > netmap or dpdk you are again decoupled from the in-kernel API/ABI and > don't
Re: [flamebait] xdp, well meaning but pointless
On 02.12.2016 17:59, Tom Herbert wrote: > On Fri, Dec 2, 2016 at 3:54 AM, Hannes Frederic Sowa >wrote: >> On 02.12.2016 11:24, Jesper Dangaard Brouer wrote: >>> On Thu, 1 Dec 2016 13:51:32 -0800 >>> Tom Herbert wrote: >>> >> The technical plenary at last IETF on Seoul a couple of weeks ago was >> exclusively focussed on DDOS in light of the recent attack against >> Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare >> presentation by Nick Sullivan >> (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf) >> alluded to some implementation of DDOS mitigation. In particular, on >> slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel" >>> >>> slide 14 >>> >> numbers he gave we're based in iptables+BPF and that was a whole >> 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic >> and that's also when I introduced XDP to whole IETF :-) ). If that's >> the best we can do the Internet is in a world hurt. DDOS mitigation >> alone is probably a sufficient motivation to look at XDP. We need >> something that drops bad packets as quickly as possible when under >> attack, we need this to be integrated into the stack, we need it to be >> programmable to deal with the increasing savvy of attackers, and we >> don't want to be forced to be dependent on HW solutions. This is why >> we created XDP! >>> >>> The 1.2Mpps number is a bit low, but we are unfortunately in that >>> ballpark. >>> > I totally understand that. But in my reply to David in this thread I > mentioned DNS apex processing as being problematic which is actually > being referred in your linked slide deck on page 9 ("What do floods look > like") and the problematic of parsing DNS packets in XDP due to string > processing and looping inside eBPF. >>> >>> That is a weak argument. You do realize CloudFlare actually use eBPF to >>> do this exact filtering, and (so-far) eBPF for parsing DNS have been >>> sufficient for them. >> >> You are talking about this code on the following slides (I actually >> transcribed it for you here and disassembled): >> >> l0: ld #0x14 >> l1: ldxb 4*([0]&0xf) >> l2: add x >> l3: tax >> l4: ld [x+0] >> l5: jeq #0x7657861, l6, l13 >> l6: ld [x+4] >> l7: jeq #0x6d706c65, l8, l13 >> l8: ld [x+8] >> l9: jeq #0x3636f6d, l10, l13 >> l10:ldb [x+12] >> l11:jeq #0, l12, l13 >> l12:ret #0x1 >> l13:ret #0 >> >> You can offload this to u32 in hardware if that is what you want. >> >> The reason this works is because of netfilter, which allows them to >> dynamically generate BPF programs and insert and delete them from >> chains, do intersection or unions of them. >> >> If you have a freestanding program like in XDP the complexity space is a >> different one and not comparable to this at all. >> > I don't understand this comment about complexity especially in regards > to the idea of offloading u32 to hardware. Relying on hardware to do > anything always leads to more complexity than an equivalent SW > implementation for the same functionality. The only reason we ever use > a hardware mechanisms is if it gives *significantly* better > performance. If the performance difference isn't there then doing > things in SW is going to be the better path (as we see in XDP). I am just wondering why the u32 filter wasn't mentioned in their slide deck. If all what Cloudflare needs are those kind of matches, they are in fact actually easier to generate than an cBPF program. It is not a good example of how a real world DoS filter in XDP would look like. If you argue XDP as a C function hook that can call arbitrary code in the driver before submitting that to the networking stack, yep, that is not complex at all. Depending on how those modules will be maintained, they either end up in the kernel and will be updated on major changes or are 3rd party and people have to update them and also depend on the driver features. But this opens up a whole new can of worms also. I haven't really thought this through completely, but last time the patches were nack'ed with lots of strong opinions and I tended to agree with them. I am revisiting this position. Certainly you can build real-world DoS protection with this function pointer hook and C code in the driver. In this case a user space solution still has advantages because of maintainability, as e.g. with netmap or dpdk you are again decoupled from the in-kernel API/ABI and don't need to test, recompile etc. on each kernel upgrade. If the module ends up in the kernel, those problems might also disappear. For XDP+eBPF to provide a full DoS mitigation (protocol parsing, sampling and dropping) solution seems to be too complex for me because of the arguments I stated in my previous
Re: [flamebait] xdp, well meaning but pointless
On Thu, 1 Dec 2016 10:11:08 +0100 Florian Westphalwrote: > In light of DPDKs existence it make a lot more sense to me to provide > a). a faster mmap based interface (possibly AF_PACKET based) that allows > to map nic directly into userspace, detaching tx/rx queue from kernel. > > John Fastabend sent something like this last year as a proof of > concept, iirc it was rejected because register space got exposed directly > to userspace. I think we should re-consider merging netmap > (or something conceptually close to its design). I'm actually working in this direction, of zero-copy RX mapping packets into userspace. This work is mostly related to page_pool, and I only plan to use XDP as a filter for selecting packets going to userspace, as this choice need to be taken very early. My design is here: https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html This is mostly about changing the memory model in the drivers, to allow for safely mapping pages to userspace. (An efficient queue mechanism is not covered). People often overlook that netmap's efficiency *also* comes from introducing pre-mapping memory/pages to userspace. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [flamebait] xdp, well meaning but pointless
On Fri, Dec 2, 2016 at 3:54 AM, Hannes Frederic Sowawrote: > On 02.12.2016 11:24, Jesper Dangaard Brouer wrote: >> On Thu, 1 Dec 2016 13:51:32 -0800 >> Tom Herbert wrote: >> > The technical plenary at last IETF on Seoul a couple of weeks ago was > exclusively focussed on DDOS in light of the recent attack against > Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare > presentation by Nick Sullivan > (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf) > alluded to some implementation of DDOS mitigation. In particular, on > slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel" >> >> slide 14 >> > numbers he gave we're based in iptables+BPF and that was a whole > 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic > and that's also when I introduced XDP to whole IETF :-) ). If that's > the best we can do the Internet is in a world hurt. DDOS mitigation > alone is probably a sufficient motivation to look at XDP. We need > something that drops bad packets as quickly as possible when under > attack, we need this to be integrated into the stack, we need it to be > programmable to deal with the increasing savvy of attackers, and we > don't want to be forced to be dependent on HW solutions. This is why > we created XDP! >> >> The 1.2Mpps number is a bit low, but we are unfortunately in that >> ballpark. >> I totally understand that. But in my reply to David in this thread I mentioned DNS apex processing as being problematic which is actually being referred in your linked slide deck on page 9 ("What do floods look like") and the problematic of parsing DNS packets in XDP due to string processing and looping inside eBPF. >> >> That is a weak argument. You do realize CloudFlare actually use eBPF to >> do this exact filtering, and (so-far) eBPF for parsing DNS have been >> sufficient for them. > > You are talking about this code on the following slides (I actually > transcribed it for you here and disassembled): > > l0: ld #0x14 > l1: ldxb 4*([0]&0xf) > l2: add x > l3: tax > l4: ld [x+0] > l5: jeq #0x7657861, l6, l13 > l6: ld [x+4] > l7: jeq #0x6d706c65, l8, l13 > l8: ld [x+8] > l9: jeq #0x3636f6d, l10, l13 > l10:ldb [x+12] > l11:jeq #0, l12, l13 > l12:ret #0x1 > l13:ret #0 > > You can offload this to u32 in hardware if that is what you want. > > The reason this works is because of netfilter, which allows them to > dynamically generate BPF programs and insert and delete them from > chains, do intersection or unions of them. > > If you have a freestanding program like in XDP the complexity space is a > different one and not comparable to this at all. > I don't understand this comment about complexity especially in regards to the idea of offloading u32 to hardware. Relying on hardware to do anything always leads to more complexity than an equivalent SW implementation for the same functionality. The only reason we ever use a hardware mechanisms is if it gives *significantly* better performance. If the performance difference isn't there then doing things in SW is going to be the better path (as we see in XDP). Tom > Bye, > Hannes >
Re: [flamebait] xdp, well meaning but pointless
On 02.12.2016 11:24, Jesper Dangaard Brouer wrote: > On Thu, 1 Dec 2016 13:51:32 -0800 > Tom Herbertwrote: > The technical plenary at last IETF on Seoul a couple of weeks ago was exclusively focussed on DDOS in light of the recent attack against Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare presentation by Nick Sullivan (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf) alluded to some implementation of DDOS mitigation. In particular, on slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel" > > slide 14 > numbers he gave we're based in iptables+BPF and that was a whole 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic and that's also when I introduced XDP to whole IETF :-) ). If that's the best we can do the Internet is in a world hurt. DDOS mitigation alone is probably a sufficient motivation to look at XDP. We need something that drops bad packets as quickly as possible when under attack, we need this to be integrated into the stack, we need it to be programmable to deal with the increasing savvy of attackers, and we don't want to be forced to be dependent on HW solutions. This is why we created XDP! > > The 1.2Mpps number is a bit low, but we are unfortunately in that > ballpark. > >>> I totally understand that. But in my reply to David in this thread I >>> mentioned DNS apex processing as being problematic which is actually >>> being referred in your linked slide deck on page 9 ("What do floods look >>> like") and the problematic of parsing DNS packets in XDP due to string >>> processing and looping inside eBPF. > > That is a weak argument. You do realize CloudFlare actually use eBPF to > do this exact filtering, and (so-far) eBPF for parsing DNS have been > sufficient for them. You are talking about this code on the following slides (I actually transcribed it for you here and disassembled): l0: ld #0x14 l1: ldxb 4*([0]&0xf) l2: add x l3: tax l4: ld [x+0] l5: jeq #0x7657861, l6, l13 l6: ld [x+4] l7: jeq #0x6d706c65, l8, l13 l8: ld [x+8] l9: jeq #0x3636f6d, l10, l13 l10:ldb [x+12] l11:jeq #0, l12, l13 l12:ret #0x1 l13:ret #0 You can offload this to u32 in hardware if that is what you want. The reason this works is because of netfilter, which allows them to dynamically generate BPF programs and insert and delete them from chains, do intersection or unions of them. If you have a freestanding program like in XDP the complexity space is a different one and not comparable to this at all. Bye, Hannes
Re: [flamebait] xdp, well meaning but pointless
On Thu, 1 Dec 2016 13:51:32 -0800 Tom Herbertwrote: > >> The technical plenary at last IETF on Seoul a couple of weeks ago was > >> exclusively focussed on DDOS in light of the recent attack against > >> Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare > >> presentation by Nick Sullivan > >> (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf) > >> alluded to some implementation of DDOS mitigation. In particular, on > >> slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel" slide 14 > >> numbers he gave we're based in iptables+BPF and that was a whole > >> 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic > >> and that's also when I introduced XDP to whole IETF :-) ). If that's > >> the best we can do the Internet is in a world hurt. DDOS mitigation > >> alone is probably a sufficient motivation to look at XDP. We need > >> something that drops bad packets as quickly as possible when under > >> attack, we need this to be integrated into the stack, we need it to be > >> programmable to deal with the increasing savvy of attackers, and we > >> don't want to be forced to be dependent on HW solutions. This is why > >> we created XDP! The 1.2Mpps number is a bit low, but we are unfortunately in that ballpark. > > I totally understand that. But in my reply to David in this thread I > > mentioned DNS apex processing as being problematic which is actually > > being referred in your linked slide deck on page 9 ("What do floods look > > like") and the problematic of parsing DNS packets in XDP due to string > > processing and looping inside eBPF. That is a weak argument. You do realize CloudFlare actually use eBPF to do this exact filtering, and (so-far) eBPF for parsing DNS have been sufficient for them. > I agree that eBPF is not going to be sufficient from everything we'll > want to do. Undoubtably, we'll continue see new addition of more > helpers to assist in processing, but at some point we will want a to > load a kernel module that handles more complex processing and insert > it at the XDP callout. Nothing in the design of XDP precludes doing > that and I have already posted the patches to generalize the XDP > callout for that. Taking either of these routes has tradeoffs, but > regardless of whether this is BPF or module code, the principles of > XDP and its value to help solve some class of problems remains. As I've said before, I do support Tom's patches for a more generic XDP hook that the kernel itself can use. The first thing I would implement with this is a fast-path for Linux L2 bridging (do depend on multiport TX support). It would be so easy to speedup bridging, XDP would only need to forward packets already in the bridge-FIB table, rest is XDP_PASS to normal stack and bridge code (timers etc). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [flamebait] xdp, well meaning but pointless
On Thu, Dec 1, 2016 at 1:27 PM, Hannes Frederic Sowawrote: > On 01.12.2016 22:12, Tom Herbert wrote: >> On Thu, Dec 1, 2016 at 12:44 PM, Hannes Frederic Sowa >> wrote: >>> Hello, >>> >>> this is a good conversation and I simply want to bring my worries >>> across. I don't have good solutions for the problems XDP tries to solve >>> but I fear we could get caught up in maintenance problems in the long >>> term given the ideas floating around on how to evolve XDP currently. >>> >>> On 01.12.2016 17:28, Thomas Graf wrote: On 12/01/16 at 04:52pm, Hannes Frederic Sowa wrote: > First of all, this is a rant targeted at XDP and not at eBPF as a whole. > XDP manipulates packets at free will and thus all security guarantees > are off as well as in any user space solution. > > Secondly user space provides policy, acl, more controlled memory > protection, restartability and better debugability. If I had multi > tenant workloads I would definitely put more complex "business/acl" > logic into user space, so I can make use of LSM and other features to > especially prevent a network facing service to attack the tenants. If > stuff gets put into the kernel you run user controlled code in the > kernel exposing a much bigger attack vector. > > What use case do you see in XDP specifically e.g. for container > networking? DDOS mitigation to protect distributed applications in large clusters. Relying on CDN works to protect API gateways and frontends (as long as they don't throw you out of their network) but offers no protection beyond that, e.g. a noisy/hostile neighbour. Doing this at the server level and allowing the mitigation capability to scale up with the number of servers is natural and cheap. >>> >>> So far we e.g. always considered L2 attacks a problem of the network >>> admin to correctly protect the environment. Are you talking about >>> protecting the L3 data plane? Are there custom proprietary protocols in >>> place which need custom protocol parsers that need involvement of the >>> kernel before it could verify the packet? >>> >>> In the past we tried to protect the L3 data plane as good as we can in >>> Linux to allow the plain old server admin to set an IP address on an >>> interface and install whatever software in user space. We try not only >>> to protect it but also try to achieve fairness by adding a lot of >>> counters everywhere. Are protections missing right now or are we talking >>> about better performance? >>> >> The technical plenary at last IETF on Seoul a couple of weeks ago was >> exclusively focussed on DDOS in light of the recent attack against >> Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare >> presentation by Nick Sullivan >> (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf) >> alluded to some implementation of DDOS mitigation. In particular, on >> slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel" >> numbers he gave we're based in iptables+BPF and that was a whole >> 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic >> and that's also when I introduced XDP to whole IETF :-) ). If that's >> the best we can do the Internet is in a world hurt. DDOS mitigation >> alone is probably a sufficient motivation to look at XDP. We need >> something that drops bad packets as quickly as possible when under >> attack, we need this to be integrated into the stack, we need it to be >> programmable to deal with the increasing savvy of attackers, and we >> don't want to be forced to be dependent on HW solutions. This is why >> we created XDP! > > I totally understand that. But in my reply to David in this thread I > mentioned DNS apex processing as being problematic which is actually > being referred in your linked slide deck on page 9 ("What do floods look > like") and the problematic of parsing DNS packets in XDP due to string > processing and looping inside eBPF. > I agree that eBPF is not going to be sufficient from everything we'll want to do. Undoubtably, we'll continue see new addition of more helpers to assist in processing, but at some point we will want a to load a kernel module that handles more complex processing and insert it at the XDP callout. Nothing in the design of XDP precludes doing that and I have already posted the patches to generalize the XDP callout for that. Taking either of these routes has tradeoffs, but regardless of whether this is BPF or module code, the principles of XDP and its value to help solve some class of problems remains. Tom > Not to mention the fact that you might have to deal with fragments in > the Internet. Some DOS mitigations were already abused to generate > blackholes for other users. Filtering such stuff is quite complicated. > > I argued also
Re: [flamebait] xdp, well meaning but pointless
On 01.12.2016 22:12, Tom Herbert wrote: > On Thu, Dec 1, 2016 at 12:44 PM, Hannes Frederic Sowa >wrote: >> Hello, >> >> this is a good conversation and I simply want to bring my worries >> across. I don't have good solutions for the problems XDP tries to solve >> but I fear we could get caught up in maintenance problems in the long >> term given the ideas floating around on how to evolve XDP currently. >> >> On 01.12.2016 17:28, Thomas Graf wrote: >>> On 12/01/16 at 04:52pm, Hannes Frederic Sowa wrote: First of all, this is a rant targeted at XDP and not at eBPF as a whole. XDP manipulates packets at free will and thus all security guarantees are off as well as in any user space solution. Secondly user space provides policy, acl, more controlled memory protection, restartability and better debugability. If I had multi tenant workloads I would definitely put more complex "business/acl" logic into user space, so I can make use of LSM and other features to especially prevent a network facing service to attack the tenants. If stuff gets put into the kernel you run user controlled code in the kernel exposing a much bigger attack vector. What use case do you see in XDP specifically e.g. for container networking? >>> >>> DDOS mitigation to protect distributed applications in large clusters. >>> Relying on CDN works to protect API gateways and frontends (as long as >>> they don't throw you out of their network) but offers no protection >>> beyond that, e.g. a noisy/hostile neighbour. Doing this at the server >>> level and allowing the mitigation capability to scale up with the number >>> of servers is natural and cheap. >> >> So far we e.g. always considered L2 attacks a problem of the network >> admin to correctly protect the environment. Are you talking about >> protecting the L3 data plane? Are there custom proprietary protocols in >> place which need custom protocol parsers that need involvement of the >> kernel before it could verify the packet? >> >> In the past we tried to protect the L3 data plane as good as we can in >> Linux to allow the plain old server admin to set an IP address on an >> interface and install whatever software in user space. We try not only >> to protect it but also try to achieve fairness by adding a lot of >> counters everywhere. Are protections missing right now or are we talking >> about better performance? >> > The technical plenary at last IETF on Seoul a couple of weeks ago was > exclusively focussed on DDOS in light of the recent attack against > Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare > presentation by Nick Sullivan > (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf) > alluded to some implementation of DDOS mitigation. In particular, on > slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel" > numbers he gave we're based in iptables+BPF and that was a whole > 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic > and that's also when I introduced XDP to whole IETF :-) ). If that's > the best we can do the Internet is in a world hurt. DDOS mitigation > alone is probably a sufficient motivation to look at XDP. We need > something that drops bad packets as quickly as possible when under > attack, we need this to be integrated into the stack, we need it to be > programmable to deal with the increasing savvy of attackers, and we > don't want to be forced to be dependent on HW solutions. This is why > we created XDP! I totally understand that. But in my reply to David in this thread I mentioned DNS apex processing as being problematic which is actually being referred in your linked slide deck on page 9 ("What do floods look like") and the problematic of parsing DNS packets in XDP due to string processing and looping inside eBPF. Not to mention the fact that you might have to deal with fragments in the Internet. Some DOS mitigations were already abused to generate blackholes for other users. Filtering such stuff is quite complicated. I argued also under the aspect of what Thomas said, that the outside world of the cluster is already protected by a CDN. Bye, Hannes
Re: [flamebait] xdp, well meaning but pointless
On Thu, Dec 1, 2016 at 12:44 PM, Hannes Frederic Sowawrote: > Hello, > > this is a good conversation and I simply want to bring my worries > across. I don't have good solutions for the problems XDP tries to solve > but I fear we could get caught up in maintenance problems in the long > term given the ideas floating around on how to evolve XDP currently. > > On 01.12.2016 17:28, Thomas Graf wrote: >> On 12/01/16 at 04:52pm, Hannes Frederic Sowa wrote: >>> First of all, this is a rant targeted at XDP and not at eBPF as a whole. >>> XDP manipulates packets at free will and thus all security guarantees >>> are off as well as in any user space solution. >>> >>> Secondly user space provides policy, acl, more controlled memory >>> protection, restartability and better debugability. If I had multi >>> tenant workloads I would definitely put more complex "business/acl" >>> logic into user space, so I can make use of LSM and other features to >>> especially prevent a network facing service to attack the tenants. If >>> stuff gets put into the kernel you run user controlled code in the >>> kernel exposing a much bigger attack vector. >>> >>> What use case do you see in XDP specifically e.g. for container networking? >> >> DDOS mitigation to protect distributed applications in large clusters. >> Relying on CDN works to protect API gateways and frontends (as long as >> they don't throw you out of their network) but offers no protection >> beyond that, e.g. a noisy/hostile neighbour. Doing this at the server >> level and allowing the mitigation capability to scale up with the number >> of servers is natural and cheap. > > So far we e.g. always considered L2 attacks a problem of the network > admin to correctly protect the environment. Are you talking about > protecting the L3 data plane? Are there custom proprietary protocols in > place which need custom protocol parsers that need involvement of the > kernel before it could verify the packet? > > In the past we tried to protect the L3 data plane as good as we can in > Linux to allow the plain old server admin to set an IP address on an > interface and install whatever software in user space. We try not only > to protect it but also try to achieve fairness by adding a lot of > counters everywhere. Are protections missing right now or are we talking > about better performance? > The technical plenary at last IETF on Seoul a couple of weeks ago was exclusively focussed on DDOS in light of the recent attack against Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare presentation by Nick Sullivan (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf) alluded to some implementation of DDOS mitigation. In particular, on slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel" numbers he gave we're based in iptables+BPF and that was a whole 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic and that's also when I introduced XDP to whole IETF :-) ). If that's the best we can do the Internet is in a world hurt. DDOS mitigation alone is probably a sufficient motivation to look at XDP. We need something that drops bad packets as quickly as possible when under attack, we need this to be integrated into the stack, we need it to be programmable to deal with the increasing savvy of attackers, and we don't want to be forced to be dependent on HW solutions. This is why we created XDP! Tom > To provide fairness you often have to share validated data within the > kernel and with XDP. This requires consistent lookup methods for sockets > in the lower level. Those can be exported to XDP via external functions > and become part of uAPI which will limit our ability to change those > functions in future. When the discussion started about early demuxing in > XDP I became really nervous, because suddenly the XDP program has to > decide correctly which protocol type it has and look in the correct > socket table for the socket. Different semantics for sockets can apply > here, e.g. some sockets are RCU managed, some end up using reference > counts. A wrong decision here would cause havoc in the kernel (XDP > considers packet as UDP but kernel stack as TCP). Also, who knows that > we won't have per-cpu socket tables we would keep that as uAPI (this is > btw. the dragonflyBSD approach to scaling)? Imagine someone writing a > SIP rewriter in XDP and depending on a coherent view of all sockets even > if their hash doesn't fit to the one of the queue? Suddenly something > which was thought of as being only mutable by one CPU becomes global > again and because of XDP we need to add locking because of uAPI. > > This discussion is parallel to the discussion about trace points, which > are not considered uAPI. If eBPF functions are not considered uAPI then > eBPF in the network stack will have much less value, because you > suddenly
Re: [flamebait] xdp, well meaning but pointless
Hello, this is a good conversation and I simply want to bring my worries across. I don't have good solutions for the problems XDP tries to solve but I fear we could get caught up in maintenance problems in the long term given the ideas floating around on how to evolve XDP currently. On 01.12.2016 17:28, Thomas Graf wrote: > On 12/01/16 at 04:52pm, Hannes Frederic Sowa wrote: >> First of all, this is a rant targeted at XDP and not at eBPF as a whole. >> XDP manipulates packets at free will and thus all security guarantees >> are off as well as in any user space solution. >> >> Secondly user space provides policy, acl, more controlled memory >> protection, restartability and better debugability. If I had multi >> tenant workloads I would definitely put more complex "business/acl" >> logic into user space, so I can make use of LSM and other features to >> especially prevent a network facing service to attack the tenants. If >> stuff gets put into the kernel you run user controlled code in the >> kernel exposing a much bigger attack vector. >> >> What use case do you see in XDP specifically e.g. for container networking? > > DDOS mitigation to protect distributed applications in large clusters. > Relying on CDN works to protect API gateways and frontends (as long as > they don't throw you out of their network) but offers no protection > beyond that, e.g. a noisy/hostile neighbour. Doing this at the server > level and allowing the mitigation capability to scale up with the number > of servers is natural and cheap. So far we e.g. always considered L2 attacks a problem of the network admin to correctly protect the environment. Are you talking about protecting the L3 data plane? Are there custom proprietary protocols in place which need custom protocol parsers that need involvement of the kernel before it could verify the packet? In the past we tried to protect the L3 data plane as good as we can in Linux to allow the plain old server admin to set an IP address on an interface and install whatever software in user space. We try not only to protect it but also try to achieve fairness by adding a lot of counters everywhere. Are protections missing right now or are we talking about better performance? To provide fairness you often have to share validated data within the kernel and with XDP. This requires consistent lookup methods for sockets in the lower level. Those can be exported to XDP via external functions and become part of uAPI which will limit our ability to change those functions in future. When the discussion started about early demuxing in XDP I became really nervous, because suddenly the XDP program has to decide correctly which protocol type it has and look in the correct socket table for the socket. Different semantics for sockets can apply here, e.g. some sockets are RCU managed, some end up using reference counts. A wrong decision here would cause havoc in the kernel (XDP considers packet as UDP but kernel stack as TCP). Also, who knows that we won't have per-cpu socket tables we would keep that as uAPI (this is btw. the dragonflyBSD approach to scaling)? Imagine someone writing a SIP rewriter in XDP and depending on a coherent view of all sockets even if their hash doesn't fit to the one of the queue? Suddenly something which was thought of as being only mutable by one CPU becomes global again and because of XDP we need to add locking because of uAPI. This discussion is parallel to the discussion about trace points, which are not considered uAPI. If eBPF functions are not considered uAPI then eBPF in the network stack will have much less value, because you suddenly depend on specific kernel versions again and cannot simply load the code into the kernel. The API checks will become very difficult to implement, see also the ongoing MODVERSIONS discussions on LKML some days back. >>> I agree with you if the LB is a software based appliance in either a >>> dedicated VM or on dedicated baremetal. >>> >>> The reality is turning out to be different in many cases though, LB >>> needs to be performed not only for north south but east west as well. >>> So even if I would handle LB for traffic entering my datacenter in user >>> space, I will need the same LB for packets from my applications and >>> I definitely don't want to move all of that into user space. >> >> The open question to me is why is programmability needed here. >> >> Look at the discussion about ECMP and consistent hashing. It is not very >> easy to actually write this code correctly. Why can't we just put C code >> into the kernel that implements this once and for all and let user space >> update the policies? > > Whatever LB logic is put in place with native C code now is unlikely the > logic we need in two years. We can't really predict the future. If it > was the case, networking would have been done long ago and we would all > be working on self eating ice cream now. Did LB algorithms on the networking layer change that much?
Re: [flamebait] xdp, well meaning but pointless
On Thu, Dec 1, 2016 at 10:01 AM, Tom Herbertwrote: > > > On Thu, Dec 1, 2016 at 1:11 AM, Florian Westphal wrote: >> >> [ As already mentioned in my reply to Tom, here is >> the xdp flamebait/critique ] >> >> Lots of XDP related patches started to appear on netdev. >> I'd prefer if it would stop... >> >> To me XDP combines all disadvantages of stack bypass solutions like dpdk >> with the disadvantages of kernel programming with a more limited >> instruction set and toolchain. >> >> Unlike XDP userspace bypass (dpdk et al) allow use of any programming >> model or language you want (including scripting languages), which >> makes things a lot easier, e.g. garbage collection, debuggers vs. >> crash+vmcore+printk... >> >> I have heared the argument that these restrictions that come with >> XDP are great because it allows to 'limit what users can do'. >> >> Given existence of DPDK/netmap/userspace bypass is a reality, this is >> a very weak argument -- why would anyone pick XDP over a dpdk/netmap >> based solution? > Because, we've seen time an time again that attempts to bypass the stack and run parallel stacks under the banner of "the kernel is too slow" does not scale for large deployment. We've seen this with RDMA, TOE, OpenOnload, and we'll see this for DPDK, FD.io, VPP and whatever else people are going to dream up. If I have a couple hundred machines running a single application like the HFT guys do, then sure I'd probably look into such solutions. But when I have datacenters with 100Ks running an assortment of applications even contemplating the possibility of deploying a parallel stacks gives me headache. We need to consider an seemingly endless list of security issues, manageability. robustness, protocol compatibility, etc. I really have little interest in bringing a huge pile of 3rd party code that I have to support, and I definitely have no interest in constantly replacing all of my hardware to get the latest and greatest support for these offloads as vendors leak them out. Given a choice between buying into some kernel bypass solution versus hacking Linux a little bit to carve out an accelerated data path to address the "kernel is too slow" argument, I will choose the latter any day of the week. Tom > > Because, we've seen time an time again that attempts to bypass the stack and > run parallel stacks under the banner of "the kernel is too slow" does not > scale for large deployment. We've seen this with RDMA, TOE, OpenOnload, and > we'll see this for DPDK, FD.io, VPP and whatever else people are going to > dream up. If I have a couple hundred machines running a single application > like the HFT guys do, then sure I'd probably look into such solutions. But > when I have datacenters with 100Ks running an assortment of applications > even contemplating the possibility of deploying a parallel stacks gives me > headache. We need to consider an seemingly endless list of security issues, > manageability. robustness, protocol compatibility, etc. I really have little > interest in bringing a huge pile of 3rd party code that I have to support, > and I definitely have no interest in constantly replacing all of my hardware > to get the latest and greatest support for these offloads as vendors leak > them out. Given a choice between buying into some kernel bypass solution > versus hacking Linux a little bit to carve out an accelerated data path to > address the "kernel is too slow" argument, I will choose the latter any day > of the week. > > Tom > >> XDP will always be less powerful and a lot more complicated, >> especially considering users of dpdk (or toolkits built on top of it) >> are not kernel programmers and userspace has more powerful ipc >> (or storage) mechanisms. >> >> Aside from this, XDP, like DPDK, is a kernel bypass. >> You might say 'Its just stack bypass, not a kernel bypass!'. >> But what does that mean exactly? That packets can still be passed >> onward to normal stack? >> Bypass solutions like netmap can also inject packets back to >> kernel stack again. >> >> Running less powerful user code in a restricted environment in the kernel >> address space is certainly a worse idea than separating this logic out >> to user space. >> >> In light of DPDKs existence it make a lot more sense to me to provide >> a). a faster mmap based interface (possibly AF_PACKET based) that allows >> to map nic directly into userspace, detaching tx/rx queue from kernel. >> >> John Fastabend sent something like this last year as a proof of >> concept, iirc it was rejected because register space got exposed directly >> to userspace. I think we should re-consider merging netmap >> (or something conceptually close to its design). >> >> b). with regards to a programmable data path: IFF one wants to do this >> in kernel (and thats a big if), it seems much more preferrable to provide >> a config/data-based approach rather than a programmable one. If you want >> full freedom DPDK
Re: [flamebait] xdp, well meaning but pointless
On 01.12.2016 17:19, David Miller wrote: > Saying that ntuple filters can handle the early drop use case doesn't > take into consideration the nature of the tables (hundreds of > thousands of "evil" IP addresses), whether hardware can actually > handle that (it can't), and whether simple IP address matching is the > full extent of it (it isn't). Yes, that is why you certainly use ntuple filters in combination with some kind of high level business logic in user space. I have to check but am pretty sure you can't even do the simplest thing in XDP, parsing the apexes of DNS packets and checking them against a hash table, because the program won't pass the verifier.
Re: [flamebait] xdp, well meaning but pointless
David Millerwrote: > Saying that ntuple filters can handle the early drop use case doesn't > take into consideration the nature of the tables (hundreds of > thousands of "evil" IP addresses), Thats not what I said. But Ok, message received. I rest my case.
Re: [flamebait] xdp, well meaning but pointless
On 12/01/16 at 04:52pm, Hannes Frederic Sowa wrote: > First of all, this is a rant targeted at XDP and not at eBPF as a whole. > XDP manipulates packets at free will and thus all security guarantees > are off as well as in any user space solution. > > Secondly user space provides policy, acl, more controlled memory > protection, restartability and better debugability. If I had multi > tenant workloads I would definitely put more complex "business/acl" > logic into user space, so I can make use of LSM and other features to > especially prevent a network facing service to attack the tenants. If > stuff gets put into the kernel you run user controlled code in the > kernel exposing a much bigger attack vector. > > What use case do you see in XDP specifically e.g. for container networking? DDOS mitigation to protect distributed applications in large clusters. Relying on CDN works to protect API gateways and frontends (as long as they don't throw you out of their network) but offers no protection beyond that, e.g. a noisy/hostile neighbour. Doing this at the server level and allowing the mitigation capability to scale up with the number of servers is natural and cheap. > > I agree with you if the LB is a software based appliance in either a > > dedicated VM or on dedicated baremetal. > > > > The reality is turning out to be different in many cases though, LB > > needs to be performed not only for north south but east west as well. > > So even if I would handle LB for traffic entering my datacenter in user > > space, I will need the same LB for packets from my applications and > > I definitely don't want to move all of that into user space. > > The open question to me is why is programmability needed here. > > Look at the discussion about ECMP and consistent hashing. It is not very > easy to actually write this code correctly. Why can't we just put C code > into the kernel that implements this once and for all and let user space > update the policies? Whatever LB logic is put in place with native C code now is unlikely the logic we need in two years. We can't really predict the future. If it was the case, networking would have been done long ago and we would all be working on self eating ice cream now. > Load balancers have to deal correctly with ICMP packets, e.g. they even > have to be duplicated to every ECMP route. This seems to be problematic > to do in eBPF programs due to looping constructs so you end up with > complicated user space anyway. Feel free to implement such complex LBs in user space or natively. It is not required for the majority of use cases. The most popular LBs for application load balancing have no idea of ECMP and require ECMP aware routers to be made redundant itself.
Re: [flamebait] xdp, well meaning but pointless
From: Thomas GrafDate: Thu, 1 Dec 2016 15:58:34 +0100 > The benefits of XDP for this use case are extremely obvious in combination > with local applications which need to be protected. ntuple filters won't > cut it. They are limited and subject to a certain rate at which they > can be configured. Any serious mitigation will require stateful filtering > with at least minimal L7 matching abilities and this is exactly where XDP > will excel. +1 Saying that ntuple filters can handle the early drop use case doesn't take into consideration the nature of the tables (hundreds of thousands of "evil" IP addresses), whether hardware can actually handle that (it can't), and whether simple IP address matching is the full extent of it (it isn't). Most of the time when I hear anti-XDP rhetoric, it's usually comes from a crowd who for some reason feels threatened by the technology and what it might replace and make useless. That to me says that we are _exactly_ going down the right path.
Re: [flamebait] xdp, well meaning but pointless
Thomas Grafwrote: > On 12/01/16 at 10:11am, Florian Westphal wrote: > > Aside from this, XDP, like DPDK, is a kernel bypass. > > You might say 'Its just stack bypass, not a kernel bypass!'. > > But what does that mean exactly? That packets can still be passed > > onward to normal stack? > > Bypass solutions like netmap can also inject packets back to > > kernel stack again. > > I have a fundamental issue with the approach of exporting packets into > user space and reinjecting them: Once the packet leaves the kernel, > any security guarantees are off. I have no control over what is > running in user space and whether whatever listener up there has been > compromised or not. To me, that's a no go, in particular for servers > hosting multi tenant workloads. This is one of the main reasons why > XDP, in particular in combination with BPF, is very interesting to me. Funny, I see it exactly the other way around :) To me packet coming from this "userspace injection" is no different than a tun/tap, or any other packet coming from network. I see no change or increase in attack surface.
Re: [flamebait] xdp, well meaning but pointless
Hi, On 01.12.2016 15:58, Thomas Graf wrote: > On 12/01/16 at 10:11am, Florian Westphal wrote: >> Aside from this, XDP, like DPDK, is a kernel bypass. >> You might say 'Its just stack bypass, not a kernel bypass!'. >> But what does that mean exactly? That packets can still be passed >> onward to normal stack? >> Bypass solutions like netmap can also inject packets back to >> kernel stack again. > > I have a fundamental issue with the approach of exporting packets into > user space and reinjecting them: Once the packet leaves the kernel, > any security guarantees are off. I have no control over what is > running in user space and whether whatever listener up there has been > compromised or not. To me, that's a no go, in particular for servers > hosting multi tenant workloads. This is one of the main reasons why > XDP, in particular in combination with BPF, is very interesting to me. First of all, this is a rant targeted at XDP and not at eBPF as a whole. XDP manipulates packets at free will and thus all security guarantees are off as well as in any user space solution. Secondly user space provides policy, acl, more controlled memory protection, restartability and better debugability. If I had multi tenant workloads I would definitely put more complex "business/acl" logic into user space, so I can make use of LSM and other features to especially prevent a network facing service to attack the tenants. If stuff gets put into the kernel you run user controlled code in the kernel exposing a much bigger attack vector. What use case do you see in XDP specifically e.g. for container networking? >> b). with regards to a programmable data path: IFF one wants to do this >> in kernel (and thats a big if), it seems much more preferrable to provide >> a config/data-based approach rather than a programmable one. If you want >> full freedom DPDK is architecturally just too powerful to compete with. > > I must have missed the legal disclaimer that is usually put in front > of the DPDK marketing show :-) > > I don't want full freedom. I want programmability with stack integration > at sufficient speed and the ability to benefit from the hardware > abstractions that the kernel provides. > >> Proponents of XDP sometimes provide usage examples. >> Lets look at some of these. > > [ I won't comment on any of the other use cases because they are of no > interest to me ] > >> * Load balancer >> State holding algorithm need sorting and searching, so also no fit for >> eBPF (could be exposed by function exports, but then can we do DoS by >> finding worst case scenarios?). >> >> Also again needs way to forward frame out via another interface. >> >> For cases where packet gets sent out via same interface it would appear >> to be easier to use port mirroring in a switch and use stochastic filtering >> on end nodes to determine which host should take responsibility. >> >> XDP plus: central authority over how distribution will work in case >> nodes are added/removed from pool. >> But then again, it will be easier to hande this with netmap/dpdk where >> more complicated scheduling algorithms can be used. > > I agree with you if the LB is a software based appliance in either a > dedicated VM or on dedicated baremetal. > > The reality is turning out to be different in many cases though, LB > needs to be performed not only for north south but east west as well. > So even if I would handle LB for traffic entering my datacenter in user > space, I will need the same LB for packets from my applications and > I definitely don't want to move all of that into user space. The open question to me is why is programmability needed here. Look at the discussion about ECMP and consistent hashing. It is not very easy to actually write this code correctly. Why can't we just put C code into the kernel that implements this once and for all and let user space update the policies? Load balancers have to deal correctly with ICMP packets, e.g. they even have to be duplicated to every ECMP route. This seems to be problematic to do in eBPF programs due to looping constructs so you end up with complicated user space anyway. >> * early drop/filtering. >> While its possible to do "u32" like filters with ebpf, all modern nics >> support ntuple filtering in hardware, which is going to be faster because >> such packet will never even be signalled to the operating system. >> For more complicated cases (e.g. doing socket lookup to check if particular >> packet does match bound socket (and expected sequence numbers etc) I don't >> see easy ways to do that with XDP (and without sk_buff context). >> Providing it via function exports is possible of course, but that will only >> result in an "arms race" where we will see special-sauce functions >> all over the place -- DoS will always attempt to go for something >> that is difficult to filter against, cf. all the recent volume-based >> floodings. > > You probably put this last because this was the most
Re: [flamebait] xdp, well meaning but pointless
On 12/01/16 at 10:11am, Florian Westphal wrote: > Aside from this, XDP, like DPDK, is a kernel bypass. > You might say 'Its just stack bypass, not a kernel bypass!'. > But what does that mean exactly? That packets can still be passed > onward to normal stack? > Bypass solutions like netmap can also inject packets back to > kernel stack again. I have a fundamental issue with the approach of exporting packets into user space and reinjecting them: Once the packet leaves the kernel, any security guarantees are off. I have no control over what is running in user space and whether whatever listener up there has been compromised or not. To me, that's a no go, in particular for servers hosting multi tenant workloads. This is one of the main reasons why XDP, in particular in combination with BPF, is very interesting to me. > b). with regards to a programmable data path: IFF one wants to do this > in kernel (and thats a big if), it seems much more preferrable to provide > a config/data-based approach rather than a programmable one. If you want > full freedom DPDK is architecturally just too powerful to compete with. I must have missed the legal disclaimer that is usually put in front of the DPDK marketing show :-) I don't want full freedom. I want programmability with stack integration at sufficient speed and the ability to benefit from the hardware abstractions that the kernel provides. > Proponents of XDP sometimes provide usage examples. > Lets look at some of these. [ I won't comment on any of the other use cases because they are of no interest to me ] > * Load balancer > State holding algorithm need sorting and searching, so also no fit for > eBPF (could be exposed by function exports, but then can we do DoS by > finding worst case scenarios?). > > Also again needs way to forward frame out via another interface. > > For cases where packet gets sent out via same interface it would appear > to be easier to use port mirroring in a switch and use stochastic filtering > on end nodes to determine which host should take responsibility. > > XDP plus: central authority over how distribution will work in case > nodes are added/removed from pool. > But then again, it will be easier to hande this with netmap/dpdk where > more complicated scheduling algorithms can be used. I agree with you if the LB is a software based appliance in either a dedicated VM or on dedicated baremetal. The reality is turning out to be different in many cases though, LB needs to be performed not only for north south but east west as well. So even if I would handle LB for traffic entering my datacenter in user space, I will need the same LB for packets from my applications and I definitely don't want to move all of that into user space. > * early drop/filtering. > While its possible to do "u32" like filters with ebpf, all modern nics > support ntuple filtering in hardware, which is going to be faster because > such packet will never even be signalled to the operating system. > For more complicated cases (e.g. doing socket lookup to check if particular > packet does match bound socket (and expected sequence numbers etc) I don't > see easy ways to do that with XDP (and without sk_buff context). > Providing it via function exports is possible of course, but that will only > result in an "arms race" where we will see special-sauce functions > all over the place -- DoS will always attempt to go for something > that is difficult to filter against, cf. all the recent volume-based > floodings. You probably put this last because this was the most difficult to shoot down ;-) The benefits of XDP for this use case are extremely obvious in combination with local applications which need to be protected. ntuple filters won't cut it. They are limited and subject to a certain rate at which they can be configured. Any serious mitigation will require stateful filtering with at least minimal L7 matching abilities and this is exactly where XDP will excel.
Re: [flamebait] xdp, well meaning but pointless
On 01.12.2016 10:11, Florian Westphal wrote: > [ As already mentioned in my reply to Tom, here is > the xdp flamebait/critique ] > > Lots of XDP related patches started to appear on netdev. > I'd prefer if it would stop... I discussed this with Florian and helped with the text. I want to mention this to express my full support for this. Thanks, Hannes
[flamebait] xdp, well meaning but pointless
[ As already mentioned in my reply to Tom, here is the xdp flamebait/critique ] Lots of XDP related patches started to appear on netdev. I'd prefer if it would stop... To me XDP combines all disadvantages of stack bypass solutions like dpdk with the disadvantages of kernel programming with a more limited instruction set and toolchain. Unlike XDP userspace bypass (dpdk et al) allow use of any programming model or language you want (including scripting languages), which makes things a lot easier, e.g. garbage collection, debuggers vs. crash+vmcore+printk... I have heared the argument that these restrictions that come with XDP are great because it allows to 'limit what users can do'. Given existence of DPDK/netmap/userspace bypass is a reality, this is a very weak argument -- why would anyone pick XDP over a dpdk/netmap based solution? XDP will always be less powerful and a lot more complicated, especially considering users of dpdk (or toolkits built on top of it) are not kernel programmers and userspace has more powerful ipc (or storage) mechanisms. Aside from this, XDP, like DPDK, is a kernel bypass. You might say 'Its just stack bypass, not a kernel bypass!'. But what does that mean exactly? That packets can still be passed onward to normal stack? Bypass solutions like netmap can also inject packets back to kernel stack again. Running less powerful user code in a restricted environment in the kernel address space is certainly a worse idea than separating this logic out to user space. In light of DPDKs existence it make a lot more sense to me to provide a). a faster mmap based interface (possibly AF_PACKET based) that allows to map nic directly into userspace, detaching tx/rx queue from kernel. John Fastabend sent something like this last year as a proof of concept, iirc it was rejected because register space got exposed directly to userspace. I think we should re-consider merging netmap (or something conceptually close to its design). b). with regards to a programmable data path: IFF one wants to do this in kernel (and thats a big if), it seems much more preferrable to provide a config/data-based approach rather than a programmable one. If you want full freedom DPDK is architecturally just too powerful to compete with. Proponents of XDP sometimes provide usage examples. Lets look at some of these. == Application developement: == * DNS Server data structures and algorithms need to be implemented in a mostly touring complete language, so eBPF cannot readily be be used for that. At least it will be orders of magnitude harder than in userspace. * TCP Endpoint TCP processing in eBPF is a bit out of question while userspace tcp stacks based on both netmap and dpdk already exist today. == Forwarding dataplane: == * Router/Switch Router and switches should actually adhere to standardized and specified protocols and thus don't need a lot of custom software and specialized software. Still a lot more work compared to userspace offloads where you can do things like allocating a 4GB array to perform nexthop lookup. Also needs ability to perform tx on another interface. * Load balancer State holding algorithm need sorting and searching, so also no fit for eBPF (could be exposed by function exports, but then can we do DoS by finding worst case scenarios?). Also again needs way to forward frame out via another interface. For cases where packet gets sent out via same interface it would appear to be easier to use port mirroring in a switch and use stochastic filtering on end nodes to determine which host should take responsibility. XDP plus: central authority over how distribution will work in case nodes are added/removed from pool. But then again, it will be easier to hande this with netmap/dpdk where more complicated scheduling algorithms can be used. * early drop/filtering. While its possible to do "u32" like filters with ebpf, all modern nics support ntuple filtering in hardware, which is going to be faster because such packet will never even be signalled to the operating system. For more complicated cases (e.g. doing socket lookup to check if particular packet does match bound socket (and expected sequence numbers etc) I don't see easy ways to do that with XDP (and without sk_buff context). Providing it via function exports is possible of course, but that will only result in an "arms race" where we will see special-sauce functions all over the place -- DoS will always attempt to go for something that is difficult to filter against, cf. all the recent volume-based floodings. Thanks, Florian