Re: [flamebait] xdp, well meaning but pointless

2016-12-05 Thread Jesper Dangaard Brouer
On Sat, 3 Dec 2016 11:48:22 -0800
John Fastabend  wrote:

> On 16-12-03 08:19 AM, Willem de Bruijn wrote:
> > On Fri, Dec 2, 2016 at 12:22 PM, Jesper Dangaard Brouer
> >  wrote:  
> >>
> >> On Thu, 1 Dec 2016 10:11:08 +0100 Florian Westphal  wrote:
> >>  
> >>> In light of DPDKs existence it make a lot more sense to me to provide
> >>> a). a faster mmap based interface (possibly AF_PACKET based) that allows
> >>> to map nic directly into userspace, detaching tx/rx queue from kernel.
> >>>
> >>> John Fastabend sent something like this last year as a proof of
> >>> concept, iirc it was rejected because register space got exposed directly
> >>> to userspace.  I think we should re-consider merging netmap
> >>> (or something conceptually close to its design).  
> >>
> >> I'm actually working in this direction, of zero-copy RX mapping packets
> >> into userspace.  This work is mostly related to page_pool, and I only
> >> plan to use XDP as a filter for selecting packets going to userspace,
> >> as this choice need to be taken very early.
> >>
> >> My design is here:
> >>  
> >> https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html
> >>
> >> This is mostly about changing the memory model in the drivers, to allow
> >> for safely mapping pages to userspace.  (An efficient queue mechanism is
> >> not covered).  
> > 
> > Virtio virtqueues are used in various other locations in the stack.
> > With separate memory pools and send + completion descriptor rings,
> > signal moderation, careful avoidance of cacheline bouncing, etc. these
> > seem like a good opportunity for a TPACKET_V4 format.
> >   
> 
> FWIW. After we rejected exposing the register space to user space due to
> valid security issues we fell back to using VFIO which works nicely for
> mapping virtual functions into userspace and VMs. The main  drawback is
> user space has to manage the VF but that is mostly a solved problem at
> this point. Deployment concerns aside.

Using VFs (PCIe SR-IOV Virtual Functions) solves this in a completely
different orthogonal way.  To me it is still like taking over the entire
NIC, although you use HW to split the traffic into VFs.  Setup for VF
deployment still looks troubling like 1G hugepages and vfio
enable_unsafe_noiommu_mode=1.  And generally getting SR-IOV working on
your HW is a task of it's own.

One thing people often seem to miss with SR-IOV VFs is that VM-to-VM
traffic will be limited by PCIe bandwidth and transaction overheads.
Like Stepen Hemminger demonstrated[1] at NetDev 1.2 and Luigi also have
a paper demonstrating this (AFAICR).
[1] http://netdevconf.org/1.2/session.html?stephen-hemminger


A key difference in my design is to, allow the NIC to be shared in a
safe manor.  The NIC functions 100% as a normal Linux controlled NIC.
The catch is that once an application request zero-copy RX, then the
NIC might have to reconfigure it's RX-ring usage.  As the driver MUST
change into what I call the "read-only packet page" mode, which
actually is the default in many drivers today.


> There was a TPACKET_V4 version we had a prototype of that passed
> buffers down to the hardware to use with the dma engine. This gives
> zero-copy but same as VFs requires the hardware to do all the steering
> of traffic and any expected policy in front of the application. Due to
> requiring user space to kick hardware and vice versa though it was
> somewhat slower so I didn't finish it up. The kick was implemented as a
> syscall iirc. I can maybe look at it a bit more next week and see if its
> worth reviving now in this context.

This is still at the design stage.  The target here is that the
page_pool and driver adjustments will provide the basis for building RX
zero-copy solutions in a memory safe manor.

I do see tcpdump/RAW packet access like TPACKET_V4 being one of the
first users of this.  Not the only user, as further down the road, I
also imagine RX zero-copy delivery into sockets (and perhaps combined
with a "raw_demux" step that doesn't alloc the SKB, which Tom hinted in
the other thread for UDP delivery).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [flamebait] xdp, well meaning but pointless

2016-12-03 Thread John Fastabend
On 16-12-03 08:19 AM, Willem de Bruijn wrote:
> On Fri, Dec 2, 2016 at 12:22 PM, Jesper Dangaard Brouer
>  wrote:
>>
>> On Thu, 1 Dec 2016 10:11:08 +0100 Florian Westphal  wrote:
>>
>>> In light of DPDKs existence it make a lot more sense to me to provide
>>> a). a faster mmap based interface (possibly AF_PACKET based) that allows
>>> to map nic directly into userspace, detaching tx/rx queue from kernel.
>>>
>>> John Fastabend sent something like this last year as a proof of
>>> concept, iirc it was rejected because register space got exposed directly
>>> to userspace.  I think we should re-consider merging netmap
>>> (or something conceptually close to its design).
>>
>> I'm actually working in this direction, of zero-copy RX mapping packets
>> into userspace.  This work is mostly related to page_pool, and I only
>> plan to use XDP as a filter for selecting packets going to userspace,
>> as this choice need to be taken very early.
>>
>> My design is here:
>>  
>> https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html
>>
>> This is mostly about changing the memory model in the drivers, to allow
>> for safely mapping pages to userspace.  (An efficient queue mechanism is
>> not covered).
> 
> Virtio virtqueues are used in various other locations in the stack.
> With separate memory pools and send + completion descriptor rings,
> signal moderation, careful avoidance of cacheline bouncing, etc. these
> seem like a good opportunity for a TPACKET_V4 format.
> 

FWIW. After we rejected exposing the register space to user space due to
valid security issues we fell back to using VFIO which works nicely for
mapping virtual functions into userspace and VMs. The main  drawback is
user space has to manage the VF but that is mostly a solved problem at
this point. Deployment concerns aside.

There was a TPACKET_V4 version we had a prototype of that passed
buffers down to the hardware to use with the dma engine. This gives
zero-copy but same as VFs requires the hardware to do all the steering
of traffic and any expected policy in front of the application. Due to
requiring user space to kick hardware and vice versa though it was
somewhat slower so I didn't finish it up. The kick was implemented as a
syscall iirc. I can maybe look at it a bit more next week and see if its
worth reviving now in this context.

I don't think any of this requires page pools though. Or rather tpacket
and vhost/virtio already know how to do page pools is perhaps the other
way to look at it.

One idea I've been playing around with is a vhost backend using
tpacketv{3|4} so we don't require socket manipulation.

Thanks,
John


Re: [flamebait] xdp, well meaning but pointless

2016-12-03 Thread Willem de Bruijn
On Fri, Dec 2, 2016 at 12:22 PM, Jesper Dangaard Brouer
 wrote:
>
> On Thu, 1 Dec 2016 10:11:08 +0100 Florian Westphal  wrote:
>
>> In light of DPDKs existence it make a lot more sense to me to provide
>> a). a faster mmap based interface (possibly AF_PACKET based) that allows
>> to map nic directly into userspace, detaching tx/rx queue from kernel.
>>
>> John Fastabend sent something like this last year as a proof of
>> concept, iirc it was rejected because register space got exposed directly
>> to userspace.  I think we should re-consider merging netmap
>> (or something conceptually close to its design).
>
> I'm actually working in this direction, of zero-copy RX mapping packets
> into userspace.  This work is mostly related to page_pool, and I only
> plan to use XDP as a filter for selecting packets going to userspace,
> as this choice need to be taken very early.
>
> My design is here:
>  
> https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html
>
> This is mostly about changing the memory model in the drivers, to allow
> for safely mapping pages to userspace.  (An efficient queue mechanism is
> not covered).

Virtio virtqueues are used in various other locations in the stack.
With separate memory pools and send + completion descriptor rings,
signal moderation, careful avoidance of cacheline bouncing, etc. these
seem like a good opportunity for a TPACKET_V4 format.


Re: [flamebait] xdp, well meaning but pointless

2016-12-02 Thread Tom Herbert
On Fri, Dec 2, 2016 at 11:56 AM, Stephen Hemminger
 wrote:
> On Fri, 2 Dec 2016 19:12:00 +0100
> Hannes Frederic Sowa  wrote:
>
>> On 02.12.2016 17:59, Tom Herbert wrote:
>> > On Fri, Dec 2, 2016 at 3:54 AM, Hannes Frederic Sowa
>> >  wrote:
>> >> On 02.12.2016 11:24, Jesper Dangaard Brouer wrote:
>> >>> On Thu, 1 Dec 2016 13:51:32 -0800
>> >>> Tom Herbert  wrote:
>> >>>
>> >> The technical plenary at last IETF on Seoul a couple of weeks ago was
>> >> exclusively focussed on DDOS in light of the recent attack against
>> >> Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare
>> >> presentation by Nick Sullivan
>> >> (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf)
>> >> alluded to some implementation of DDOS mitigation. In particular, on
>> >> slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel"
>> >>>
>> >>> slide 14
>> >>>
>> >> numbers he gave we're based in iptables+BPF and that was a whole
>> >> 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic
>> >> and that's also when I introduced XDP to whole IETF :-) ). If that's
>> >> the best we can do the Internet is in a world hurt. DDOS mitigation
>> >> alone is probably a sufficient motivation to look at XDP. We need
>> >> something that drops bad packets as quickly as possible when under
>> >> attack, we need this to be integrated into the stack, we need it to be
>> >> programmable to deal with the increasing savvy of attackers, and we
>> >> don't want to be forced to be dependent on HW solutions. This is why
>> >> we created XDP!
>> >>>
>> >>> The 1.2Mpps number is a bit low, but we are unfortunately in that
>> >>> ballpark.
>> >>>
>> > I totally understand that. But in my reply to David in this thread I
>> > mentioned DNS apex processing as being problematic which is actually
>> > being referred in your linked slide deck on page 9 ("What do floods 
>> > look
>> > like") and the problematic of parsing DNS packets in XDP due to string
>> > processing and looping inside eBPF.
>> >>>
>> >>> That is a weak argument. You do realize CloudFlare actually use eBPF to
>> >>> do this exact filtering, and (so-far) eBPF for parsing DNS have been
>> >>> sufficient for them.
>> >>
>> >> You are talking about this code on the following slides (I actually
>> >> transcribed it for you here and disassembled):
>> >>
>> >> l0: ld #0x14
>> >> l1: ldxb 4*([0]&0xf)
>> >> l2: add x
>> >> l3: tax
>> >> l4: ld [x+0]
>> >> l5: jeq #0x7657861, l6, l13
>> >> l6: ld [x+4]
>> >> l7: jeq #0x6d706c65, l8, l13
>> >> l8: ld [x+8]
>> >> l9: jeq #0x3636f6d, l10, l13
>> >> l10:ldb [x+12]
>> >> l11:jeq #0, l12, l13
>> >> l12:ret #0x1
>> >> l13:ret #0
>> >>
>> >> You can offload this to u32 in hardware if that is what you want.
>> >>
>> >> The reason this works is because of netfilter, which allows them to
>> >> dynamically generate BPF programs and insert and delete them from
>> >> chains, do intersection or unions of them.
>> >>
>> >> If you have a freestanding program like in XDP the complexity space is a
>> >> different one and not comparable to this at all.
>> >>
>> > I don't understand this comment about complexity especially in regards
>> > to the idea of offloading u32 to hardware. Relying on hardware to do
>> > anything always leads to more complexity than an equivalent SW
>> > implementation for the same functionality. The only reason we ever use
>> > a hardware mechanisms is if it gives *significantly* better
>> > performance. If the performance difference isn't there then doing
>> > things in SW is going to be the better path (as we see in XDP).
>>
>> I am just wondering why the u32 filter wasn't mentioned in their slide
>> deck. If all what Cloudflare needs are those kind of matches, they are
>> in fact actually easier to generate than an cBPF program. It is not a
>> good example of how a real world DoS filter in XDP would look like.
>>
>> If you argue XDP as a C function hook that can call arbitrary code in
>> the driver before submitting that to the networking stack, yep, that is
>> not complex at all. Depending on how those modules will be maintained,
>> they either end up in the kernel and will be updated on major changes or
>> are 3rd party and people have to update them and also depend on the
>> driver features.
>>
>> But this opens up a whole new can of worms also. I haven't really
>> thought this through completely, but last time the patches were nack'ed
>> with lots of strong opinions and I tended to agree with them. I am
>> revisiting this position.
>>
>> Certainly you can build real-world DoS protection with this function
>> pointer hook and C code in the driver. In this 

Re: [flamebait] xdp, well meaning but pointless

2016-12-02 Thread Stephen Hemminger
On Fri, 2 Dec 2016 19:12:00 +0100
Hannes Frederic Sowa  wrote:

> On 02.12.2016 17:59, Tom Herbert wrote:
> > On Fri, Dec 2, 2016 at 3:54 AM, Hannes Frederic Sowa
> >  wrote:  
> >> On 02.12.2016 11:24, Jesper Dangaard Brouer wrote:  
> >>> On Thu, 1 Dec 2016 13:51:32 -0800
> >>> Tom Herbert  wrote:
> >>>  
> >> The technical plenary at last IETF on Seoul a couple of weeks ago was
> >> exclusively focussed on DDOS in light of the recent attack against
> >> Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare
> >> presentation by Nick Sullivan
> >> (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf)
> >> alluded to some implementation of DDOS mitigation. In particular, on
> >> slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel"  
> >>>
> >>> slide 14
> >>>  
> >> numbers he gave we're based in iptables+BPF and that was a whole
> >> 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic
> >> and that's also when I introduced XDP to whole IETF :-) ). If that's
> >> the best we can do the Internet is in a world hurt. DDOS mitigation
> >> alone is probably a sufficient motivation to look at XDP. We need
> >> something that drops bad packets as quickly as possible when under
> >> attack, we need this to be integrated into the stack, we need it to be
> >> programmable to deal with the increasing savvy of attackers, and we
> >> don't want to be forced to be dependent on HW solutions. This is why
> >> we created XDP!  
> >>>
> >>> The 1.2Mpps number is a bit low, but we are unfortunately in that
> >>> ballpark.
> >>>  
> > I totally understand that. But in my reply to David in this thread I
> > mentioned DNS apex processing as being problematic which is actually
> > being referred in your linked slide deck on page 9 ("What do floods look
> > like") and the problematic of parsing DNS packets in XDP due to string
> > processing and looping inside eBPF.  
> >>>
> >>> That is a weak argument. You do realize CloudFlare actually use eBPF to
> >>> do this exact filtering, and (so-far) eBPF for parsing DNS have been
> >>> sufficient for them.  
> >>
> >> You are talking about this code on the following slides (I actually
> >> transcribed it for you here and disassembled):
> >>
> >> l0: ld #0x14
> >> l1: ldxb 4*([0]&0xf)
> >> l2: add x
> >> l3: tax
> >> l4: ld [x+0]
> >> l5: jeq #0x7657861, l6, l13
> >> l6: ld [x+4]
> >> l7: jeq #0x6d706c65, l8, l13
> >> l8: ld [x+8]
> >> l9: jeq #0x3636f6d, l10, l13
> >> l10:ldb [x+12]
> >> l11:jeq #0, l12, l13
> >> l12:ret #0x1
> >> l13:ret #0
> >>
> >> You can offload this to u32 in hardware if that is what you want.
> >>
> >> The reason this works is because of netfilter, which allows them to
> >> dynamically generate BPF programs and insert and delete them from
> >> chains, do intersection or unions of them.
> >>
> >> If you have a freestanding program like in XDP the complexity space is a
> >> different one and not comparable to this at all.
> >>  
> > I don't understand this comment about complexity especially in regards
> > to the idea of offloading u32 to hardware. Relying on hardware to do
> > anything always leads to more complexity than an equivalent SW
> > implementation for the same functionality. The only reason we ever use
> > a hardware mechanisms is if it gives *significantly* better
> > performance. If the performance difference isn't there then doing
> > things in SW is going to be the better path (as we see in XDP).  
> 
> I am just wondering why the u32 filter wasn't mentioned in their slide
> deck. If all what Cloudflare needs are those kind of matches, they are
> in fact actually easier to generate than an cBPF program. It is not a
> good example of how a real world DoS filter in XDP would look like.
> 
> If you argue XDP as a C function hook that can call arbitrary code in
> the driver before submitting that to the networking stack, yep, that is
> not complex at all. Depending on how those modules will be maintained,
> they either end up in the kernel and will be updated on major changes or
> are 3rd party and people have to update them and also depend on the
> driver features.
> 
> But this opens up a whole new can of worms also. I haven't really
> thought this through completely, but last time the patches were nack'ed
> with lots of strong opinions and I tended to agree with them. I am
> revisiting this position.
> 
> Certainly you can build real-world DoS protection with this function
> pointer hook and C code in the driver. In this case a user space
> solution still has advantages because of maintainability, as e.g. with
> netmap or dpdk you are again decoupled from the in-kernel API/ABI and
> don't 

Re: [flamebait] xdp, well meaning but pointless

2016-12-02 Thread Hannes Frederic Sowa
On 02.12.2016 17:59, Tom Herbert wrote:
> On Fri, Dec 2, 2016 at 3:54 AM, Hannes Frederic Sowa
>  wrote:
>> On 02.12.2016 11:24, Jesper Dangaard Brouer wrote:
>>> On Thu, 1 Dec 2016 13:51:32 -0800
>>> Tom Herbert  wrote:
>>>
>> The technical plenary at last IETF on Seoul a couple of weeks ago was
>> exclusively focussed on DDOS in light of the recent attack against
>> Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare
>> presentation by Nick Sullivan
>> (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf)
>> alluded to some implementation of DDOS mitigation. In particular, on
>> slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel"
>>>
>>> slide 14
>>>
>> numbers he gave we're based in iptables+BPF and that was a whole
>> 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic
>> and that's also when I introduced XDP to whole IETF :-) ). If that's
>> the best we can do the Internet is in a world hurt. DDOS mitigation
>> alone is probably a sufficient motivation to look at XDP. We need
>> something that drops bad packets as quickly as possible when under
>> attack, we need this to be integrated into the stack, we need it to be
>> programmable to deal with the increasing savvy of attackers, and we
>> don't want to be forced to be dependent on HW solutions. This is why
>> we created XDP!
>>>
>>> The 1.2Mpps number is a bit low, but we are unfortunately in that
>>> ballpark.
>>>
> I totally understand that. But in my reply to David in this thread I
> mentioned DNS apex processing as being problematic which is actually
> being referred in your linked slide deck on page 9 ("What do floods look
> like") and the problematic of parsing DNS packets in XDP due to string
> processing and looping inside eBPF.
>>>
>>> That is a weak argument. You do realize CloudFlare actually use eBPF to
>>> do this exact filtering, and (so-far) eBPF for parsing DNS have been
>>> sufficient for them.
>>
>> You are talking about this code on the following slides (I actually
>> transcribed it for you here and disassembled):
>>
>> l0: ld #0x14
>> l1: ldxb 4*([0]&0xf)
>> l2: add x
>> l3: tax
>> l4: ld [x+0]
>> l5: jeq #0x7657861, l6, l13
>> l6: ld [x+4]
>> l7: jeq #0x6d706c65, l8, l13
>> l8: ld [x+8]
>> l9: jeq #0x3636f6d, l10, l13
>> l10:ldb [x+12]
>> l11:jeq #0, l12, l13
>> l12:ret #0x1
>> l13:ret #0
>>
>> You can offload this to u32 in hardware if that is what you want.
>>
>> The reason this works is because of netfilter, which allows them to
>> dynamically generate BPF programs and insert and delete them from
>> chains, do intersection or unions of them.
>>
>> If you have a freestanding program like in XDP the complexity space is a
>> different one and not comparable to this at all.
>>
> I don't understand this comment about complexity especially in regards
> to the idea of offloading u32 to hardware. Relying on hardware to do
> anything always leads to more complexity than an equivalent SW
> implementation for the same functionality. The only reason we ever use
> a hardware mechanisms is if it gives *significantly* better
> performance. If the performance difference isn't there then doing
> things in SW is going to be the better path (as we see in XDP).

I am just wondering why the u32 filter wasn't mentioned in their slide
deck. If all what Cloudflare needs are those kind of matches, they are
in fact actually easier to generate than an cBPF program. It is not a
good example of how a real world DoS filter in XDP would look like.

If you argue XDP as a C function hook that can call arbitrary code in
the driver before submitting that to the networking stack, yep, that is
not complex at all. Depending on how those modules will be maintained,
they either end up in the kernel and will be updated on major changes or
are 3rd party and people have to update them and also depend on the
driver features.

But this opens up a whole new can of worms also. I haven't really
thought this through completely, but last time the patches were nack'ed
with lots of strong opinions and I tended to agree with them. I am
revisiting this position.

Certainly you can build real-world DoS protection with this function
pointer hook and C code in the driver. In this case a user space
solution still has advantages because of maintainability, as e.g. with
netmap or dpdk you are again decoupled from the in-kernel API/ABI and
don't need to test, recompile etc. on each kernel upgrade. If the module
ends up in the kernel, those problems might also disappear.

For XDP+eBPF to provide a full DoS mitigation (protocol parsing,
sampling and dropping) solution seems to be too complex for me because
of the arguments I stated in my previous 

Re: [flamebait] xdp, well meaning but pointless

2016-12-02 Thread Jesper Dangaard Brouer

On Thu, 1 Dec 2016 10:11:08 +0100 Florian Westphal  wrote:

> In light of DPDKs existence it make a lot more sense to me to provide
> a). a faster mmap based interface (possibly AF_PACKET based) that allows
> to map nic directly into userspace, detaching tx/rx queue from kernel.
> 
> John Fastabend sent something like this last year as a proof of
> concept, iirc it was rejected because register space got exposed directly
> to userspace.  I think we should re-consider merging netmap
> (or something conceptually close to its design).

I'm actually working in this direction, of zero-copy RX mapping packets
into userspace.  This work is mostly related to page_pool, and I only
plan to use XDP as a filter for selecting packets going to userspace,
as this choice need to be taken very early.

My design is here:
 
https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html

This is mostly about changing the memory model in the drivers, to allow
for safely mapping pages to userspace.  (An efficient queue mechanism is
not covered).  People often overlook that netmap's efficiency *also* comes
from introducing pre-mapping memory/pages to userspace.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [flamebait] xdp, well meaning but pointless

2016-12-02 Thread Tom Herbert
On Fri, Dec 2, 2016 at 3:54 AM, Hannes Frederic Sowa
 wrote:
> On 02.12.2016 11:24, Jesper Dangaard Brouer wrote:
>> On Thu, 1 Dec 2016 13:51:32 -0800
>> Tom Herbert  wrote:
>>
> The technical plenary at last IETF on Seoul a couple of weeks ago was
> exclusively focussed on DDOS in light of the recent attack against
> Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare
> presentation by Nick Sullivan
> (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf)
> alluded to some implementation of DDOS mitigation. In particular, on
> slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel"
>>
>> slide 14
>>
> numbers he gave we're based in iptables+BPF and that was a whole
> 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic
> and that's also when I introduced XDP to whole IETF :-) ). If that's
> the best we can do the Internet is in a world hurt. DDOS mitigation
> alone is probably a sufficient motivation to look at XDP. We need
> something that drops bad packets as quickly as possible when under
> attack, we need this to be integrated into the stack, we need it to be
> programmable to deal with the increasing savvy of attackers, and we
> don't want to be forced to be dependent on HW solutions. This is why
> we created XDP!
>>
>> The 1.2Mpps number is a bit low, but we are unfortunately in that
>> ballpark.
>>
 I totally understand that. But in my reply to David in this thread I
 mentioned DNS apex processing as being problematic which is actually
 being referred in your linked slide deck on page 9 ("What do floods look
 like") and the problematic of parsing DNS packets in XDP due to string
 processing and looping inside eBPF.
>>
>> That is a weak argument. You do realize CloudFlare actually use eBPF to
>> do this exact filtering, and (so-far) eBPF for parsing DNS have been
>> sufficient for them.
>
> You are talking about this code on the following slides (I actually
> transcribed it for you here and disassembled):
>
> l0: ld #0x14
> l1: ldxb 4*([0]&0xf)
> l2: add x
> l3: tax
> l4: ld [x+0]
> l5: jeq #0x7657861, l6, l13
> l6: ld [x+4]
> l7: jeq #0x6d706c65, l8, l13
> l8: ld [x+8]
> l9: jeq #0x3636f6d, l10, l13
> l10:ldb [x+12]
> l11:jeq #0, l12, l13
> l12:ret #0x1
> l13:ret #0
>
> You can offload this to u32 in hardware if that is what you want.
>
> The reason this works is because of netfilter, which allows them to
> dynamically generate BPF programs and insert and delete them from
> chains, do intersection or unions of them.
>
> If you have a freestanding program like in XDP the complexity space is a
> different one and not comparable to this at all.
>
I don't understand this comment about complexity especially in regards
to the idea of offloading u32 to hardware. Relying on hardware to do
anything always leads to more complexity than an equivalent SW
implementation for the same functionality. The only reason we ever use
a hardware mechanisms is if it gives *significantly* better
performance. If the performance difference isn't there then doing
things in SW is going to be the better path (as we see in XDP).

Tom

> Bye,
> Hannes
>


Re: [flamebait] xdp, well meaning but pointless

2016-12-02 Thread Hannes Frederic Sowa
On 02.12.2016 11:24, Jesper Dangaard Brouer wrote:
> On Thu, 1 Dec 2016 13:51:32 -0800
> Tom Herbert  wrote:
> 
 The technical plenary at last IETF on Seoul a couple of weeks ago was
 exclusively focussed on DDOS in light of the recent attack against
 Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare
 presentation by Nick Sullivan
 (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf)
 alluded to some implementation of DDOS mitigation. In particular, on
 slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel"
> 
> slide 14
> 
 numbers he gave we're based in iptables+BPF and that was a whole
 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic
 and that's also when I introduced XDP to whole IETF :-) ). If that's
 the best we can do the Internet is in a world hurt. DDOS mitigation
 alone is probably a sufficient motivation to look at XDP. We need
 something that drops bad packets as quickly as possible when under
 attack, we need this to be integrated into the stack, we need it to be
 programmable to deal with the increasing savvy of attackers, and we
 don't want to be forced to be dependent on HW solutions. This is why
 we created XDP!  
> 
> The 1.2Mpps number is a bit low, but we are unfortunately in that
> ballpark.
> 
>>> I totally understand that. But in my reply to David in this thread I
>>> mentioned DNS apex processing as being problematic which is actually
>>> being referred in your linked slide deck on page 9 ("What do floods look
>>> like") and the problematic of parsing DNS packets in XDP due to string
>>> processing and looping inside eBPF.
> 
> That is a weak argument. You do realize CloudFlare actually use eBPF to
> do this exact filtering, and (so-far) eBPF for parsing DNS have been
> sufficient for them.

You are talking about this code on the following slides (I actually
transcribed it for you here and disassembled):

l0: ld #0x14
l1: ldxb 4*([0]&0xf)
l2: add x
l3: tax
l4: ld [x+0]
l5: jeq #0x7657861, l6, l13
l6: ld [x+4]
l7: jeq #0x6d706c65, l8, l13
l8: ld [x+8]
l9: jeq #0x3636f6d, l10, l13
l10:ldb [x+12]
l11:jeq #0, l12, l13
l12:ret #0x1
l13:ret #0

You can offload this to u32 in hardware if that is what you want.

The reason this works is because of netfilter, which allows them to
dynamically generate BPF programs and insert and delete them from
chains, do intersection or unions of them.

If you have a freestanding program like in XDP the complexity space is a
different one and not comparable to this at all.

Bye,
Hannes



Re: [flamebait] xdp, well meaning but pointless

2016-12-02 Thread Jesper Dangaard Brouer
On Thu, 1 Dec 2016 13:51:32 -0800
Tom Herbert  wrote:

> >> The technical plenary at last IETF on Seoul a couple of weeks ago was
> >> exclusively focussed on DDOS in light of the recent attack against
> >> Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare
> >> presentation by Nick Sullivan
> >> (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf)
> >> alluded to some implementation of DDOS mitigation. In particular, on
> >> slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel"

slide 14

> >> numbers he gave we're based in iptables+BPF and that was a whole
> >> 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic
> >> and that's also when I introduced XDP to whole IETF :-) ). If that's
> >> the best we can do the Internet is in a world hurt. DDOS mitigation
> >> alone is probably a sufficient motivation to look at XDP. We need
> >> something that drops bad packets as quickly as possible when under
> >> attack, we need this to be integrated into the stack, we need it to be
> >> programmable to deal with the increasing savvy of attackers, and we
> >> don't want to be forced to be dependent on HW solutions. This is why
> >> we created XDP!  

The 1.2Mpps number is a bit low, but we are unfortunately in that
ballpark.

> > I totally understand that. But in my reply to David in this thread I
> > mentioned DNS apex processing as being problematic which is actually
> > being referred in your linked slide deck on page 9 ("What do floods look
> > like") and the problematic of parsing DNS packets in XDP due to string
> > processing and looping inside eBPF.

That is a weak argument. You do realize CloudFlare actually use eBPF to
do this exact filtering, and (so-far) eBPF for parsing DNS have been
sufficient for them.

> I agree that eBPF is not going to be sufficient from everything we'll
> want to do. Undoubtably, we'll continue see new addition of more
> helpers to assist in processing, but at some point we will want a to
> load a kernel module that handles more complex processing and insert
> it at the XDP callout. Nothing in the design of XDP precludes doing
> that and I have already posted the patches to generalize the XDP
> callout for that. Taking either of these routes has tradeoffs, but
> regardless of whether this is BPF or module code, the principles of
> XDP and its value to help solve some class of problems remains.

As I've said before, I do support Tom's patches for a more generic XDP
hook that the kernel itself can use.  The first thing I would implement
with this is a fast-path for Linux L2 bridging (do depend on multiport
TX support). It would be so easy to speedup bridging, XDP would only
need to forward packets already in the bridge-FIB table, rest is
XDP_PASS to normal stack and bridge code (timers etc).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [flamebait] xdp, well meaning but pointless

2016-12-01 Thread Tom Herbert
On Thu, Dec 1, 2016 at 1:27 PM, Hannes Frederic Sowa
 wrote:
> On 01.12.2016 22:12, Tom Herbert wrote:
>> On Thu, Dec 1, 2016 at 12:44 PM, Hannes Frederic Sowa
>>  wrote:
>>> Hello,
>>>
>>> this is a good conversation and I simply want to bring my worries
>>> across. I don't have good solutions for the problems XDP tries to solve
>>> but I fear we could get caught up in maintenance problems in the long
>>> term given the ideas floating around on how to evolve XDP currently.
>>>
>>> On 01.12.2016 17:28, Thomas Graf wrote:
 On 12/01/16 at 04:52pm, Hannes Frederic Sowa wrote:
> First of all, this is a rant targeted at XDP and not at eBPF as a whole.
> XDP manipulates packets at free will and thus all security guarantees
> are off as well as in any user space solution.
>
> Secondly user space provides policy, acl, more controlled memory
> protection, restartability and better debugability. If I had multi
> tenant workloads I would definitely put more complex "business/acl"
> logic into user space, so I can make use of LSM and other features to
> especially prevent a network facing service to attack the tenants. If
> stuff gets put into the kernel you run user controlled code in the
> kernel exposing a much bigger attack vector.
>
> What use case do you see in XDP specifically e.g. for container 
> networking?

 DDOS mitigation to protect distributed applications in large clusters.
 Relying on CDN works to protect API gateways and frontends (as long as
 they don't throw you out of their network) but offers no protection
 beyond that, e.g. a noisy/hostile neighbour. Doing this at the server
 level and allowing the mitigation capability to scale up with the number
 of servers is natural and cheap.
>>>
>>> So far we e.g. always considered L2 attacks a problem of the network
>>> admin to correctly protect the environment. Are you talking about
>>> protecting the L3 data plane? Are there custom proprietary protocols in
>>> place which need custom protocol parsers that need involvement of the
>>> kernel before it could verify the packet?
>>>
>>> In the past we tried to protect the L3 data plane as good as we can in
>>> Linux to allow the plain old server admin to set an IP address on an
>>> interface and install whatever software in user space. We try not only
>>> to protect it but also try to achieve fairness by adding a lot of
>>> counters everywhere. Are protections missing right now or are we talking
>>> about better performance?
>>>
>> The technical plenary at last IETF on Seoul a couple of weeks ago was
>> exclusively focussed on DDOS in light of the recent attack against
>> Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare
>> presentation by Nick Sullivan
>> (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf)
>> alluded to some implementation of DDOS mitigation. In particular, on
>> slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel"
>> numbers he gave we're based in iptables+BPF and that was a whole
>> 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic
>> and that's also when I introduced XDP to whole IETF :-) ). If that's
>> the best we can do the Internet is in a world hurt. DDOS mitigation
>> alone is probably a sufficient motivation to look at XDP. We need
>> something that drops bad packets as quickly as possible when under
>> attack, we need this to be integrated into the stack, we need it to be
>> programmable to deal with the increasing savvy of attackers, and we
>> don't want to be forced to be dependent on HW solutions. This is why
>> we created XDP!
>
> I totally understand that. But in my reply to David in this thread I
> mentioned DNS apex processing as being problematic which is actually
> being referred in your linked slide deck on page 9 ("What do floods look
> like") and the problematic of parsing DNS packets in XDP due to string
> processing and looping inside eBPF.
>
I agree that eBPF is not going to be sufficient from everything we'll
want to do. Undoubtably, we'll continue see new addition of more
helpers to assist in processing, but at some point we will want a to
load a kernel module that handles more complex processing and insert
it at the XDP callout. Nothing in the design of XDP precludes doing
that and I have already posted the patches to generalize the XDP
callout for that. Taking either of these routes has tradeoffs, but
regardless of whether this is BPF or module code, the principles of
XDP and its value to help solve some class of problems remains.

 Tom

> Not to mention the fact that you might have to deal with fragments in
> the Internet. Some DOS mitigations were already abused to generate
> blackholes for other users. Filtering such stuff is quite complicated.
>
> I argued also 

Re: [flamebait] xdp, well meaning but pointless

2016-12-01 Thread Hannes Frederic Sowa
On 01.12.2016 22:12, Tom Herbert wrote:
> On Thu, Dec 1, 2016 at 12:44 PM, Hannes Frederic Sowa
>  wrote:
>> Hello,
>>
>> this is a good conversation and I simply want to bring my worries
>> across. I don't have good solutions for the problems XDP tries to solve
>> but I fear we could get caught up in maintenance problems in the long
>> term given the ideas floating around on how to evolve XDP currently.
>>
>> On 01.12.2016 17:28, Thomas Graf wrote:
>>> On 12/01/16 at 04:52pm, Hannes Frederic Sowa wrote:
 First of all, this is a rant targeted at XDP and not at eBPF as a whole.
 XDP manipulates packets at free will and thus all security guarantees
 are off as well as in any user space solution.

 Secondly user space provides policy, acl, more controlled memory
 protection, restartability and better debugability. If I had multi
 tenant workloads I would definitely put more complex "business/acl"
 logic into user space, so I can make use of LSM and other features to
 especially prevent a network facing service to attack the tenants. If
 stuff gets put into the kernel you run user controlled code in the
 kernel exposing a much bigger attack vector.

 What use case do you see in XDP specifically e.g. for container networking?
>>>
>>> DDOS mitigation to protect distributed applications in large clusters.
>>> Relying on CDN works to protect API gateways and frontends (as long as
>>> they don't throw you out of their network) but offers no protection
>>> beyond that, e.g. a noisy/hostile neighbour. Doing this at the server
>>> level and allowing the mitigation capability to scale up with the number
>>> of servers is natural and cheap.
>>
>> So far we e.g. always considered L2 attacks a problem of the network
>> admin to correctly protect the environment. Are you talking about
>> protecting the L3 data plane? Are there custom proprietary protocols in
>> place which need custom protocol parsers that need involvement of the
>> kernel before it could verify the packet?
>>
>> In the past we tried to protect the L3 data plane as good as we can in
>> Linux to allow the plain old server admin to set an IP address on an
>> interface and install whatever software in user space. We try not only
>> to protect it but also try to achieve fairness by adding a lot of
>> counters everywhere. Are protections missing right now or are we talking
>> about better performance?
>>
> The technical plenary at last IETF on Seoul a couple of weeks ago was
> exclusively focussed on DDOS in light of the recent attack against
> Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare
> presentation by Nick Sullivan
> (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf)
> alluded to some implementation of DDOS mitigation. In particular, on
> slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel"
> numbers he gave we're based in iptables+BPF and that was a whole
> 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic
> and that's also when I introduced XDP to whole IETF :-) ). If that's
> the best we can do the Internet is in a world hurt. DDOS mitigation
> alone is probably a sufficient motivation to look at XDP. We need
> something that drops bad packets as quickly as possible when under
> attack, we need this to be integrated into the stack, we need it to be
> programmable to deal with the increasing savvy of attackers, and we
> don't want to be forced to be dependent on HW solutions. This is why
> we created XDP!

I totally understand that. But in my reply to David in this thread I
mentioned DNS apex processing as being problematic which is actually
being referred in your linked slide deck on page 9 ("What do floods look
like") and the problematic of parsing DNS packets in XDP due to string
processing and looping inside eBPF.

Not to mention the fact that you might have to deal with fragments in
the Internet. Some DOS mitigations were already abused to generate
blackholes for other users. Filtering such stuff is quite complicated.

I argued also under the aspect of what Thomas said, that the outside
world of the cluster is already protected by a CDN.

Bye,
Hannes



Re: [flamebait] xdp, well meaning but pointless

2016-12-01 Thread Tom Herbert
On Thu, Dec 1, 2016 at 12:44 PM, Hannes Frederic Sowa
 wrote:
> Hello,
>
> this is a good conversation and I simply want to bring my worries
> across. I don't have good solutions for the problems XDP tries to solve
> but I fear we could get caught up in maintenance problems in the long
> term given the ideas floating around on how to evolve XDP currently.
>
> On 01.12.2016 17:28, Thomas Graf wrote:
>> On 12/01/16 at 04:52pm, Hannes Frederic Sowa wrote:
>>> First of all, this is a rant targeted at XDP and not at eBPF as a whole.
>>> XDP manipulates packets at free will and thus all security guarantees
>>> are off as well as in any user space solution.
>>>
>>> Secondly user space provides policy, acl, more controlled memory
>>> protection, restartability and better debugability. If I had multi
>>> tenant workloads I would definitely put more complex "business/acl"
>>> logic into user space, so I can make use of LSM and other features to
>>> especially prevent a network facing service to attack the tenants. If
>>> stuff gets put into the kernel you run user controlled code in the
>>> kernel exposing a much bigger attack vector.
>>>
>>> What use case do you see in XDP specifically e.g. for container networking?
>>
>> DDOS mitigation to protect distributed applications in large clusters.
>> Relying on CDN works to protect API gateways and frontends (as long as
>> they don't throw you out of their network) but offers no protection
>> beyond that, e.g. a noisy/hostile neighbour. Doing this at the server
>> level and allowing the mitigation capability to scale up with the number
>> of servers is natural and cheap.
>
> So far we e.g. always considered L2 attacks a problem of the network
> admin to correctly protect the environment. Are you talking about
> protecting the L3 data plane? Are there custom proprietary protocols in
> place which need custom protocol parsers that need involvement of the
> kernel before it could verify the packet?
>
> In the past we tried to protect the L3 data plane as good as we can in
> Linux to allow the plain old server admin to set an IP address on an
> interface and install whatever software in user space. We try not only
> to protect it but also try to achieve fairness by adding a lot of
> counters everywhere. Are protections missing right now or are we talking
> about better performance?
>
The technical plenary at last IETF on Seoul a couple of weeks ago was
exclusively focussed on DDOS in light of the recent attack against
Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare
presentation by Nick Sullivan
(https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf)
alluded to some implementation of DDOS mitigation. In particular, on
slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel"
numbers he gave we're based in iptables+BPF and that was a whole
1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic
and that's also when I introduced XDP to whole IETF :-) ). If that's
the best we can do the Internet is in a world hurt. DDOS mitigation
alone is probably a sufficient motivation to look at XDP. We need
something that drops bad packets as quickly as possible when under
attack, we need this to be integrated into the stack, we need it to be
programmable to deal with the increasing savvy of attackers, and we
don't want to be forced to be dependent on HW solutions. This is why
we created XDP!

Tom

> To provide fairness you often have to share validated data within the
> kernel and with XDP. This requires consistent lookup methods for sockets
> in the lower level. Those can be exported to XDP via external functions
> and become part of uAPI which will limit our ability to change those
> functions in future. When the discussion started about early demuxing in
> XDP I became really nervous, because suddenly the XDP program has to
> decide correctly which protocol type it has and look in the correct
> socket table for the socket. Different semantics for sockets can apply
> here, e.g. some sockets are RCU managed, some end up using reference
> counts. A wrong decision here would cause havoc in the kernel (XDP
> considers packet as UDP but kernel stack as TCP). Also, who knows that
> we won't have per-cpu socket tables we would keep that as uAPI (this is
> btw. the dragonflyBSD approach to scaling)? Imagine someone writing a
> SIP rewriter in XDP and depending on a coherent view of all sockets even
> if their hash doesn't fit to the one of the queue? Suddenly something
> which was thought of as being only mutable by one CPU becomes global
> again and because of XDP we need to add locking because of uAPI.
>
> This discussion is parallel to the discussion about trace points, which
> are not considered uAPI. If eBPF functions are not considered uAPI then
> eBPF in the network stack will have much less value, because you
> suddenly 

Re: [flamebait] xdp, well meaning but pointless

2016-12-01 Thread Hannes Frederic Sowa
Hello,

this is a good conversation and I simply want to bring my worries
across. I don't have good solutions for the problems XDP tries to solve
but I fear we could get caught up in maintenance problems in the long
term given the ideas floating around on how to evolve XDP currently.

On 01.12.2016 17:28, Thomas Graf wrote:
> On 12/01/16 at 04:52pm, Hannes Frederic Sowa wrote:
>> First of all, this is a rant targeted at XDP and not at eBPF as a whole.
>> XDP manipulates packets at free will and thus all security guarantees
>> are off as well as in any user space solution.
>>
>> Secondly user space provides policy, acl, more controlled memory
>> protection, restartability and better debugability. If I had multi
>> tenant workloads I would definitely put more complex "business/acl"
>> logic into user space, so I can make use of LSM and other features to
>> especially prevent a network facing service to attack the tenants. If
>> stuff gets put into the kernel you run user controlled code in the
>> kernel exposing a much bigger attack vector.
>>
>> What use case do you see in XDP specifically e.g. for container networking?
> 
> DDOS mitigation to protect distributed applications in large clusters.
> Relying on CDN works to protect API gateways and frontends (as long as
> they don't throw you out of their network) but offers no protection
> beyond that, e.g. a noisy/hostile neighbour. Doing this at the server
> level and allowing the mitigation capability to scale up with the number
> of servers is natural and cheap.

So far we e.g. always considered L2 attacks a problem of the network
admin to correctly protect the environment. Are you talking about
protecting the L3 data plane? Are there custom proprietary protocols in
place which need custom protocol parsers that need involvement of the
kernel before it could verify the packet?

In the past we tried to protect the L3 data plane as good as we can in
Linux to allow the plain old server admin to set an IP address on an
interface and install whatever software in user space. We try not only
to protect it but also try to achieve fairness by adding a lot of
counters everywhere. Are protections missing right now or are we talking
about better performance?

To provide fairness you often have to share validated data within the
kernel and with XDP. This requires consistent lookup methods for sockets
in the lower level. Those can be exported to XDP via external functions
and become part of uAPI which will limit our ability to change those
functions in future. When the discussion started about early demuxing in
XDP I became really nervous, because suddenly the XDP program has to
decide correctly which protocol type it has and look in the correct
socket table for the socket. Different semantics for sockets can apply
here, e.g. some sockets are RCU managed, some end up using reference
counts. A wrong decision here would cause havoc in the kernel (XDP
considers packet as UDP but kernel stack as TCP). Also, who knows that
we won't have per-cpu socket tables we would keep that as uAPI (this is
btw. the dragonflyBSD approach to scaling)? Imagine someone writing a
SIP rewriter in XDP and depending on a coherent view of all sockets even
if their hash doesn't fit to the one of the queue? Suddenly something
which was thought of as being only mutable by one CPU becomes global
again and because of XDP we need to add locking because of uAPI.

This discussion is parallel to the discussion about trace points, which
are not considered uAPI. If eBPF functions are not considered uAPI then
eBPF in the network stack will have much less value, because you
suddenly depend on specific kernel versions again and cannot simply load
the code into the kernel. The API checks will become very difficult to
implement, see also the ongoing MODVERSIONS discussions on LKML some
days back.

>>> I agree with you if the LB is a software based appliance in either a
>>> dedicated VM or on dedicated baremetal.
>>>
>>> The reality is turning out to be different in many cases though, LB
>>> needs to be performed not only for north south but east west as well.
>>> So even if I would handle LB for traffic entering my datacenter in user
>>> space, I will need the same LB for packets from my applications and
>>> I definitely don't want to move all of that into user space.
>>
>> The open question to me is why is programmability needed here.
>>
>> Look at the discussion about ECMP and consistent hashing. It is not very
>> easy to actually write this code correctly. Why can't we just put C code
>> into the kernel that implements this once and for all and let user space
>> update the policies?
> 
> Whatever LB logic is put in place with native C code now is unlikely the
> logic we need in two years. We can't really predict the future. If it
> was the case, networking would have been done long ago and we would all
> be working on self eating ice cream now.

Did LB algorithms on the networking layer change that much?


Re: [flamebait] xdp, well meaning but pointless

2016-12-01 Thread Tom Herbert
On Thu, Dec 1, 2016 at 10:01 AM, Tom Herbert  wrote:
>
>
> On Thu, Dec 1, 2016 at 1:11 AM, Florian Westphal  wrote:
>>
>> [ As already mentioned in my reply to Tom, here is
>> the xdp flamebait/critique ]
>>
>> Lots of XDP related patches started to appear on netdev.
>> I'd prefer if it would stop...
>>
>> To me XDP combines all disadvantages of stack bypass solutions like dpdk
>> with the disadvantages of kernel programming with a more limited
>> instruction set and toolchain.
>>
>> Unlike XDP userspace bypass (dpdk et al) allow use of any programming
>> model or language you want (including scripting languages), which
>> makes things a lot easier, e.g. garbage collection, debuggers vs.
>> crash+vmcore+printk...
>>
>> I have heared the argument that these restrictions that come with
>> XDP are great because it allows to 'limit what users can do'.
>>
>> Given existence of DPDK/netmap/userspace bypass is a reality, this is
>> a very weak argument -- why would anyone pick XDP over a dpdk/netmap
>> based solution?
>
Because, we've seen time an time again that attempts to bypass the
stack and run parallel stacks under the banner of "the kernel is too
slow" does not scale for large deployment. We've seen this with RDMA,
TOE, OpenOnload, and we'll see this for DPDK, FD.io, VPP and whatever
else people are going to dream up. If I have a couple hundred machines
running a single application like the HFT guys do, then sure I'd
probably look into such solutions. But when I have datacenters with
100Ks running an assortment of applications even contemplating the
possibility of deploying a parallel stacks gives me headache. We need
to consider an seemingly endless list of security issues,
manageability. robustness, protocol compatibility, etc. I really have
little interest in bringing a huge pile of 3rd party code that I have
to support, and I definitely have no interest in constantly replacing
all of my hardware to get the latest and greatest support for these
offloads as vendors leak them out. Given a choice between buying into
some kernel bypass solution versus hacking Linux a little bit to carve
out an accelerated data path to address the "kernel is too slow"
argument, I will choose the latter any day of the week.

Tom

>
> Because, we've seen time an time again that attempts to bypass the stack and
> run parallel stacks under the banner of "the kernel is too slow" does not
> scale for large deployment. We've seen this with RDMA, TOE, OpenOnload, and
> we'll see this for DPDK, FD.io, VPP and whatever else people are going to
> dream up. If I have a couple hundred machines running a single application
> like the HFT guys do, then sure I'd probably look into such solutions. But
> when I have datacenters with 100Ks running an assortment of applications
> even contemplating the possibility of deploying a parallel stacks gives me
> headache. We need to consider an seemingly endless list of security issues,
> manageability. robustness, protocol compatibility, etc. I really have little
> interest in bringing a huge pile of 3rd party code that I have to support,
> and I definitely have no interest in constantly replacing all of my hardware
> to get the latest and greatest support for these offloads as vendors leak
> them out. Given a choice between buying into some kernel bypass solution
> versus hacking Linux a little bit to carve out an accelerated data path to
> address the "kernel is too slow" argument, I will choose the latter any day
> of the week.
>
> Tom
>
>> XDP will always be less powerful and a lot more complicated,
>> especially considering users of dpdk (or toolkits built on top of it)
>> are not kernel programmers and userspace has more powerful ipc
>> (or storage) mechanisms.
>>
>> Aside from this, XDP, like DPDK, is a kernel bypass.
>> You might say 'Its just stack bypass, not a kernel bypass!'.
>> But what does that mean exactly?  That packets can still be passed
>> onward to normal stack?
>> Bypass solutions like netmap can also inject packets back to
>> kernel stack again.
>>
>> Running less powerful user code in a restricted environment in the kernel
>> address space is certainly a worse idea than separating this logic out
>> to user space.
>>
>> In light of DPDKs existence it make a lot more sense to me to provide
>> a). a faster mmap based interface (possibly AF_PACKET based) that allows
>> to map nic directly into userspace, detaching tx/rx queue from kernel.
>>
>> John Fastabend sent something like this last year as a proof of
>> concept, iirc it was rejected because register space got exposed directly
>> to userspace.  I think we should re-consider merging netmap
>> (or something conceptually close to its design).
>>
>> b). with regards to a programmable data path: IFF one wants to do this
>> in kernel (and thats a big if), it seems much more preferrable to provide
>> a config/data-based approach rather than a programmable one.  If you want
>> full freedom DPDK 

Re: [flamebait] xdp, well meaning but pointless

2016-12-01 Thread Hannes Frederic Sowa
On 01.12.2016 17:19, David Miller wrote:
> Saying that ntuple filters can handle the early drop use case doesn't
> take into consideration the nature of the tables (hundreds of
> thousands of "evil" IP addresses), whether hardware can actually
> handle that (it can't), and whether simple IP address matching is the
> full extent of it (it isn't).

Yes, that is why you certainly use ntuple filters in combination with
some kind of high level business logic in user space.

I have to check but am pretty sure you can't even do the simplest thing
in XDP, parsing the apexes of DNS packets and checking them against a
hash table, because the program won't pass the verifier.



Re: [flamebait] xdp, well meaning but pointless

2016-12-01 Thread Florian Westphal
David Miller  wrote:
> Saying that ntuple filters can handle the early drop use case doesn't
> take into consideration the nature of the tables (hundreds of
> thousands of "evil" IP addresses),

Thats not what I said.

But Ok, message received. I rest my case.


Re: [flamebait] xdp, well meaning but pointless

2016-12-01 Thread Thomas Graf
On 12/01/16 at 04:52pm, Hannes Frederic Sowa wrote:
> First of all, this is a rant targeted at XDP and not at eBPF as a whole.
> XDP manipulates packets at free will and thus all security guarantees
> are off as well as in any user space solution.
> 
> Secondly user space provides policy, acl, more controlled memory
> protection, restartability and better debugability. If I had multi
> tenant workloads I would definitely put more complex "business/acl"
> logic into user space, so I can make use of LSM and other features to
> especially prevent a network facing service to attack the tenants. If
> stuff gets put into the kernel you run user controlled code in the
> kernel exposing a much bigger attack vector.
> 
> What use case do you see in XDP specifically e.g. for container networking?

DDOS mitigation to protect distributed applications in large clusters.
Relying on CDN works to protect API gateways and frontends (as long as
they don't throw you out of their network) but offers no protection
beyond that, e.g. a noisy/hostile neighbour. Doing this at the server
level and allowing the mitigation capability to scale up with the number
of servers is natural and cheap.

> > I agree with you if the LB is a software based appliance in either a
> > dedicated VM or on dedicated baremetal.
> > 
> > The reality is turning out to be different in many cases though, LB
> > needs to be performed not only for north south but east west as well.
> > So even if I would handle LB for traffic entering my datacenter in user
> > space, I will need the same LB for packets from my applications and
> > I definitely don't want to move all of that into user space.
> 
> The open question to me is why is programmability needed here.
> 
> Look at the discussion about ECMP and consistent hashing. It is not very
> easy to actually write this code correctly. Why can't we just put C code
> into the kernel that implements this once and for all and let user space
> update the policies?

Whatever LB logic is put in place with native C code now is unlikely the
logic we need in two years. We can't really predict the future. If it
was the case, networking would have been done long ago and we would all
be working on self eating ice cream now.

> Load balancers have to deal correctly with ICMP packets, e.g. they even
> have to be duplicated to every ECMP route. This seems to be problematic
> to do in eBPF programs due to looping constructs so you end up with
> complicated user space anyway.

Feel free to implement such complex LBs in user space or natively. It is
not required for the majority of use cases. The most popular LBs for
application load balancing have no idea of ECMP and require ECMP aware
routers to be made redundant itself.


Re: [flamebait] xdp, well meaning but pointless

2016-12-01 Thread David Miller
From: Thomas Graf 
Date: Thu, 1 Dec 2016 15:58:34 +0100

> The benefits of XDP for this use case are extremely obvious in combination
> with local applications which need to be protected. ntuple filters won't
> cut it. They are limited and subject to a certain rate at which they
> can be configured. Any serious mitigation will require stateful filtering
> with at least minimal L7 matching abilities and this is exactly where XDP
> will excel.

+1

Saying that ntuple filters can handle the early drop use case doesn't
take into consideration the nature of the tables (hundreds of
thousands of "evil" IP addresses), whether hardware can actually
handle that (it can't), and whether simple IP address matching is the
full extent of it (it isn't).

Most of the time when I hear anti-XDP rhetoric, it's usually comes
from a crowd who for some reason feels threatened by the technology
and what it might replace and make useless.

That to me says that we are _exactly_ going down the right path.


Re: [flamebait] xdp, well meaning but pointless

2016-12-01 Thread Florian Westphal
Thomas Graf  wrote:
> On 12/01/16 at 10:11am, Florian Westphal wrote:
> > Aside from this, XDP, like DPDK, is a kernel bypass.
> > You might say 'Its just stack bypass, not a kernel bypass!'.
> > But what does that mean exactly?  That packets can still be passed
> > onward to normal stack?
> > Bypass solutions like netmap can also inject packets back to
> > kernel stack again.
> 
> I have a fundamental issue with the approach of exporting packets into
> user space and reinjecting them: Once the packet leaves the kernel,
> any security guarantees are off. I have no control over what is
> running in user space and whether whatever listener up there has been
> compromised or not. To me, that's a no go, in particular for servers
> hosting multi tenant workloads. This is one of the main reasons why
> XDP, in particular in combination with BPF, is very interesting to me.

Funny, I see it exactly the other way around :)

To me packet coming from this "userspace injection" is no different than
a tun/tap, or any other packet coming from network.

I see no change or increase in attack surface.



Re: [flamebait] xdp, well meaning but pointless

2016-12-01 Thread Hannes Frederic Sowa
Hi,

On 01.12.2016 15:58, Thomas Graf wrote:
> On 12/01/16 at 10:11am, Florian Westphal wrote:
>> Aside from this, XDP, like DPDK, is a kernel bypass.
>> You might say 'Its just stack bypass, not a kernel bypass!'.
>> But what does that mean exactly?  That packets can still be passed
>> onward to normal stack?
>> Bypass solutions like netmap can also inject packets back to
>> kernel stack again.
> 
> I have a fundamental issue with the approach of exporting packets into
> user space and reinjecting them: Once the packet leaves the kernel,
> any security guarantees are off. I have no control over what is
> running in user space and whether whatever listener up there has been
> compromised or not. To me, that's a no go, in particular for servers
> hosting multi tenant workloads. This is one of the main reasons why
> XDP, in particular in combination with BPF, is very interesting to me.

First of all, this is a rant targeted at XDP and not at eBPF as a whole.
XDP manipulates packets at free will and thus all security guarantees
are off as well as in any user space solution.

Secondly user space provides policy, acl, more controlled memory
protection, restartability and better debugability. If I had multi
tenant workloads I would definitely put more complex "business/acl"
logic into user space, so I can make use of LSM and other features to
especially prevent a network facing service to attack the tenants. If
stuff gets put into the kernel you run user controlled code in the
kernel exposing a much bigger attack vector.

What use case do you see in XDP specifically e.g. for container networking?

>> b). with regards to a programmable data path: IFF one wants to do this
>> in kernel (and thats a big if), it seems much more preferrable to provide
>> a config/data-based approach rather than a programmable one.  If you want
>> full freedom DPDK is architecturally just too powerful to compete with.
> 
> I must have missed the legal disclaimer that is usually put in front
> of the DPDK marketing show :-)
>
> I don't want full freedom. I want programmability with stack integration
> at sufficient speed and the ability to benefit from the hardware
> abstractions that the kernel provides.
> 
>> Proponents of XDP sometimes provide usage examples.
>> Lets look at some of these.
> 
> [ I won't comment on any of the other use cases because they are of no
>   interest to me ]
> 
>> * Load balancer
>> State holding algorithm need sorting and searching, so also no fit for
>> eBPF (could be exposed by function exports, but then can we do DoS by
>> finding worst case scenarios?).
>>
>> Also again needs way to forward frame out via another interface.
>>
>> For cases where packet gets sent out via same interface it would appear
>> to be easier to use port mirroring in a switch and use stochastic filtering
>> on end nodes to determine which host should take responsibility.
>>
>> XDP plus: central authority over how distribution will work in case
>> nodes are added/removed from pool.
>> But then again, it will be easier to hande this with netmap/dpdk where
>> more complicated scheduling algorithms can be used.
> 
> I agree with you if the LB is a software based appliance in either a
> dedicated VM or on dedicated baremetal.
> 
> The reality is turning out to be different in many cases though, LB
> needs to be performed not only for north south but east west as well.
> So even if I would handle LB for traffic entering my datacenter in user
> space, I will need the same LB for packets from my applications and
> I definitely don't want to move all of that into user space.

The open question to me is why is programmability needed here.

Look at the discussion about ECMP and consistent hashing. It is not very
easy to actually write this code correctly. Why can't we just put C code
into the kernel that implements this once and for all and let user space
update the policies?

Load balancers have to deal correctly with ICMP packets, e.g. they even
have to be duplicated to every ECMP route. This seems to be problematic
to do in eBPF programs due to looping constructs so you end up with
complicated user space anyway.

>> * early drop/filtering.
>> While its possible to do "u32" like filters with ebpf, all modern nics
>> support ntuple filtering in hardware, which is going to be faster because
>> such packet will never even be signalled to the operating system.
>> For more complicated cases (e.g. doing socket lookup to check if particular
>> packet does match bound socket (and expected sequence numbers etc) I don't
>> see easy ways to do that with XDP (and without sk_buff context).
>> Providing it via function exports is possible of course, but that will only
>> result in an "arms race" where we will see special-sauce functions
>> all over the place -- DoS will always attempt to go for something
>> that is difficult to filter against, cf. all the recent volume-based
>> floodings.
> 
> You probably put this last because this was the most 

Re: [flamebait] xdp, well meaning but pointless

2016-12-01 Thread Thomas Graf
On 12/01/16 at 10:11am, Florian Westphal wrote:
> Aside from this, XDP, like DPDK, is a kernel bypass.
> You might say 'Its just stack bypass, not a kernel bypass!'.
> But what does that mean exactly?  That packets can still be passed
> onward to normal stack?
> Bypass solutions like netmap can also inject packets back to
> kernel stack again.

I have a fundamental issue with the approach of exporting packets into
user space and reinjecting them: Once the packet leaves the kernel,
any security guarantees are off. I have no control over what is
running in user space and whether whatever listener up there has been
compromised or not. To me, that's a no go, in particular for servers
hosting multi tenant workloads. This is one of the main reasons why
XDP, in particular in combination with BPF, is very interesting to me.

> b). with regards to a programmable data path: IFF one wants to do this
> in kernel (and thats a big if), it seems much more preferrable to provide
> a config/data-based approach rather than a programmable one.  If you want
> full freedom DPDK is architecturally just too powerful to compete with.

I must have missed the legal disclaimer that is usually put in front
of the DPDK marketing show :-)

I don't want full freedom. I want programmability with stack integration
at sufficient speed and the ability to benefit from the hardware
abstractions that the kernel provides.

> Proponents of XDP sometimes provide usage examples.
> Lets look at some of these.

[ I won't comment on any of the other use cases because they are of no
  interest to me ]

> * Load balancer
> State holding algorithm need sorting and searching, so also no fit for
> eBPF (could be exposed by function exports, but then can we do DoS by
> finding worst case scenarios?).
> 
> Also again needs way to forward frame out via another interface.
> 
> For cases where packet gets sent out via same interface it would appear
> to be easier to use port mirroring in a switch and use stochastic filtering
> on end nodes to determine which host should take responsibility.
> 
> XDP plus: central authority over how distribution will work in case
> nodes are added/removed from pool.
> But then again, it will be easier to hande this with netmap/dpdk where
> more complicated scheduling algorithms can be used.

I agree with you if the LB is a software based appliance in either a
dedicated VM or on dedicated baremetal.

The reality is turning out to be different in many cases though, LB
needs to be performed not only for north south but east west as well.
So even if I would handle LB for traffic entering my datacenter in user
space, I will need the same LB for packets from my applications and
I definitely don't want to move all of that into user space.

> * early drop/filtering.
> While its possible to do "u32" like filters with ebpf, all modern nics
> support ntuple filtering in hardware, which is going to be faster because
> such packet will never even be signalled to the operating system.
> For more complicated cases (e.g. doing socket lookup to check if particular
> packet does match bound socket (and expected sequence numbers etc) I don't
> see easy ways to do that with XDP (and without sk_buff context).
> Providing it via function exports is possible of course, but that will only
> result in an "arms race" where we will see special-sauce functions
> all over the place -- DoS will always attempt to go for something
> that is difficult to filter against, cf. all the recent volume-based
> floodings.

You probably put this last because this was the most difficult to
shoot down ;-)

The benefits of XDP for this use case are extremely obvious in combination
with local applications which need to be protected. ntuple filters won't
cut it. They are limited and subject to a certain rate at which they
can be configured. Any serious mitigation will require stateful filtering
with at least minimal L7 matching abilities and this is exactly where XDP
will excel.


Re: [flamebait] xdp, well meaning but pointless

2016-12-01 Thread Hannes Frederic Sowa
On 01.12.2016 10:11, Florian Westphal wrote:
> [ As already mentioned in my reply to Tom, here is
> the xdp flamebait/critique ]
> 
> Lots of XDP related patches started to appear on netdev.
> I'd prefer if it would stop...

I discussed this with Florian and helped with the text. I want to
mention this to express my full support for this.

Thanks,
Hannes



[flamebait] xdp, well meaning but pointless

2016-12-01 Thread Florian Westphal
[ As already mentioned in my reply to Tom, here is
the xdp flamebait/critique ]

Lots of XDP related patches started to appear on netdev.
I'd prefer if it would stop...

To me XDP combines all disadvantages of stack bypass solutions like dpdk
with the disadvantages of kernel programming with a more limited
instruction set and toolchain.

Unlike XDP userspace bypass (dpdk et al) allow use of any programming
model or language you want (including scripting languages), which
makes things a lot easier, e.g. garbage collection, debuggers vs.
crash+vmcore+printk...

I have heared the argument that these restrictions that come with
XDP are great because it allows to 'limit what users can do'.

Given existence of DPDK/netmap/userspace bypass is a reality, this is
a very weak argument -- why would anyone pick XDP over a dpdk/netmap
based solution?
XDP will always be less powerful and a lot more complicated,
especially considering users of dpdk (or toolkits built on top of it)
are not kernel programmers and userspace has more powerful ipc
(or storage) mechanisms.

Aside from this, XDP, like DPDK, is a kernel bypass.
You might say 'Its just stack bypass, not a kernel bypass!'.
But what does that mean exactly?  That packets can still be passed
onward to normal stack?
Bypass solutions like netmap can also inject packets back to
kernel stack again.

Running less powerful user code in a restricted environment in the kernel
address space is certainly a worse idea than separating this logic out
to user space.

In light of DPDKs existence it make a lot more sense to me to provide
a). a faster mmap based interface (possibly AF_PACKET based) that allows
to map nic directly into userspace, detaching tx/rx queue from kernel.

John Fastabend sent something like this last year as a proof of
concept, iirc it was rejected because register space got exposed directly
to userspace.  I think we should re-consider merging netmap
(or something conceptually close to its design).

b). with regards to a programmable data path: IFF one wants to do this
in kernel (and thats a big if), it seems much more preferrable to provide
a config/data-based approach rather than a programmable one.  If you want
full freedom DPDK is architecturally just too powerful to compete with.

Proponents of XDP sometimes provide usage examples.
Lets look at some of these.

== Application developement: ==
* DNS Server
data structures and algorithms need to be implemented in a mostly touring
complete language, so eBPF cannot readily be be used for that.
At least it will be orders of magnitude harder than in userspace.

* TCP Endpoint
TCP processing in eBPF is a bit out of question while userspace tcp stacks
based on both netmap and dpdk already exist today.

== Forwarding dataplane: ==

* Router/Switch
Router and switches should actually adhere to standardized and specified
protocols and thus don't need a lot of custom software and specialized
software.  Still a lot more work compared to userspace offloads where
you can do things like allocating a 4GB array to perform nexthop lookup.
Also needs ability to perform tx on another interface.

* Load balancer
State holding algorithm need sorting and searching, so also no fit for
eBPF (could be exposed by function exports, but then can we do DoS by
finding worst case scenarios?).

Also again needs way to forward frame out via another interface.

For cases where packet gets sent out via same interface it would appear
to be easier to use port mirroring in a switch and use stochastic filtering
on end nodes to determine which host should take responsibility.

XDP plus: central authority over how distribution will work in case
nodes are added/removed from pool.
But then again, it will be easier to hande this with netmap/dpdk where
more complicated scheduling algorithms can be used.

* early drop/filtering.
While its possible to do "u32" like filters with ebpf, all modern nics
support ntuple filtering in hardware, which is going to be faster because
such packet will never even be signalled to the operating system.
For more complicated cases (e.g. doing socket lookup to check if particular
packet does match bound socket (and expected sequence numbers etc) I don't
see easy ways to do that with XDP (and without sk_buff context).
Providing it via function exports is possible of course, but that will only
result in an "arms race" where we will see special-sauce functions
all over the place -- DoS will always attempt to go for something
that is difficult to filter against, cf. all the recent volume-based
floodings.

Thanks, Florian