Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop

2016-04-04 Thread Alexei Starovoitov
On Mon, Apr 04, 2016 at 09:48:46AM +0200, Jesper Dangaard Brouer wrote:
> On Sat, 2 Apr 2016 22:41:04 -0700
> Brenden Blanco  wrote:
> 
> > On Sat, Apr 02, 2016 at 12:47:16PM -0400, Tom Herbert wrote:
> >
> > > Very nice! Do you think this hook will be sufficient to implement a
> > > fast forward patch also?
> 
> (DMA experts please verify and correct me!)
> 
> One of the gotchas is how DMA sync/unmap works.  For forwarding you
> need to modify the headers.  The DMA sync API (DMA_FROM_DEVICE) specify
> that the data is to be _considered_ read-only.  AFAIK you can write into
> the data, BUT on DMA_unmap the API/DMA-engine is allowed to overwrite
> data... note on most archs the DMA_unmap does not overwrite.
> 
> This DMA issue should not block the work on a hook for early packet drop.
> Maybe we should add a flag option, that can specify to the hook if the
> packet read-only? (e.g. if driver use page-fragments and DMA_sync)
> 
> 
> We should have another track/thread on how to solve the DMA issue:
> I see two solutions.
> 
> Solution 1: Simply use a "full" page per packet and do the DMA_unmap.
> This result in a slowdown on arch's with expensive DMA-map/unmap.  And
> we stress the page allocator more (can be solved with a page-pool-cache).
> Eric will not like this due to memory usage, but we can just add a
> "copy-break" step for normal stack hand-off.
> 
> Solution 2: (Due credit to Alex Duyck, this idea came up while
> discussing issue with him).  Remember DMA_sync'ed data is only
> considered read-only, because the DMA_unmap can be destructive.  In many
> cases DMA_unmap is not.  Thus, we could take advantage of this, and
> allow modifying DMA sync'ed data on those DMA setups.

I bet on those device dma_sync is a noop as well.
In ndo_bpf_set we can check
if (sync_single_for_cpu != swiotlb_sync_single_for_cpu)
 return -ENOTSUPP;

to avoid all these problems altogether. We're doing this to have
as high as possible performance, so we have to sacrifice generality.

This BPF_PROG_TYPE_PHYS_DEV program type is only applicable to physical
ethernet networking device and the name clearly indicates that.
Devices like taps or veth will not have such ndo.
These are early architectural decisions that we have to make to
actually hit our performance targets.
This is not 'yet another hook in the stack'. We already have tc+cls_bpf
that is pretty fast, but it's generic and works with veth, taps, phys dev
and by design operates on skb.
The BPF_PROG_TYPE_PHYS_DEV is operating on dma buffer. Virtual devices
don't have dma buffers, so no ndo.
Probably the confusion is due to 'pseudo skb' name in the patches.
I guess we have to pick some other name.



Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop

2016-04-04 Thread Jesper Dangaard Brouer
On Sat, 2 Apr 2016 22:41:04 -0700
Brenden Blanco  wrote:

> On Sat, Apr 02, 2016 at 12:47:16PM -0400, Tom Herbert wrote:
>
> > Very nice! Do you think this hook will be sufficient to implement a
> > fast forward patch also?

(DMA experts please verify and correct me!)

One of the gotchas is how DMA sync/unmap works.  For forwarding you
need to modify the headers.  The DMA sync API (DMA_FROM_DEVICE) specify
that the data is to be _considered_ read-only.  AFAIK you can write into
the data, BUT on DMA_unmap the API/DMA-engine is allowed to overwrite
data... note on most archs the DMA_unmap does not overwrite.

This DMA issue should not block the work on a hook for early packet drop.
Maybe we should add a flag option, that can specify to the hook if the
packet read-only? (e.g. if driver use page-fragments and DMA_sync)


We should have another track/thread on how to solve the DMA issue:
I see two solutions.

Solution 1: Simply use a "full" page per packet and do the DMA_unmap.
This result in a slowdown on arch's with expensive DMA-map/unmap.  And
we stress the page allocator more (can be solved with a page-pool-cache).
Eric will not like this due to memory usage, but we can just add a
"copy-break" step for normal stack hand-off.

Solution 2: (Due credit to Alex Duyck, this idea came up while
discussing issue with him).  Remember DMA_sync'ed data is only
considered read-only, because the DMA_unmap can be destructive.  In many
cases DMA_unmap is not.  Thus, we could take advantage of this, and
allow modifying DMA sync'ed data on those DMA setups.


> That is the goal, but more work needs to be done of course. It won't be
> possible with just a single pseudo skb, the driver will need a fast
> way to get batches of pseudo skbs (per core?) through from rx to tx.
> In mlx4 for instance, either the skb needs to be much more complete
> to be handled from the start of mlx4_en_xmit(), or that function
> would need to be split so that the fast tx could start midway through.
> 
> Or, skb allocation just gets much faster. Then it should be pretty
> straightforward.

With the bulking SLUB API, we can reduce the bare kmem_cache_alloc+free
cost per SKB from 90 cycles to 27 cycles.  It is good, but for really
fast forwarding it would be good to avoid allocating any extra data
structures.  We just want to move a RX packet-page to a TX ring queue.

Maybe the 27 cycles kmem_cache/slab cost is considered "fast-enough",
for what we gain in ease of implementation.  The real expensive part of
the SKB process is memset/clearing the SKB.  Which the fast forward
use-case could avoid.  Splitting the SKB alloc and clearing part would
be a needed first step.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop

2016-04-04 Thread Johannes Berg
On Sun, 2016-04-03 at 11:28 +0900, Lorenzo Colitti wrote:

> That said, getting BPF to the driver is part of the picture. On the
> chipsets we're targeting for APF, we're only seeing 2k-4k of memory
> (that's 256-512 BPF instructions) available for filtering code, which
> means that BPF might be too large.

That's true, but I think that as far as the userspace API is concerned
that shouldn't really be an issue. I think we can compile the BPF into
APF, similar to how BPF can be compiled into machine code today.
Additionally, I'm not sure we can realistically expect all devices to
really implement APF "natively", I think there's a good chance but
there's also a possibility of compiling to the native firmware
environment, for example.

johannes


Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop

2016-04-02 Thread Brenden Blanco
On Sat, Apr 02, 2016 at 12:47:16PM -0400, Tom Herbert wrote:
> Very nice! Do you think this hook will be sufficient to implement a
> fast forward patch also?
That is the goal, but more work needs to be done of course. It won't be
possible with just a single pseudo skb, the driver will need a fast way to get
batches of pseudo skbs (per core?) through from rx to tx. In mlx4 for
instance, either the skb needs to be much more complete to be handled from the
start of mlx4_en_xmit(), or that function would need to be split so that the
fast tx could start midway through.

Or, skb allocation just gets much faster. Then it should be pretty
straightforward.
> 
> Tom


Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop

2016-04-02 Thread Lorenzo Colitti
On Sun, Apr 3, 2016 at 7:57 AM, Tom Herbert  wrote:
> I am curious though, how do you think this would specifically help
> Android with power? Seems like the receiver still needs to be powered
> to receive packets to filter them anyway...

The receiver is powered up, but its wake/sleep cycles are much shorter
than the main CPU's. On a phone, leaving the CPU asleep with wifi on
might consume ~5mA average, but getting the CPU out of suspend might
average ~200mA for ~300ms as the system comes out of sleep,
initializes other hardware, wakes up userspace processes whose
timeouts have fired, freezes, and suspends again. Receiving one such
superfluous packet every 3 seconds (e.g., on networks that send
identical IPv6 RAs once every 3 seconds) works out to ~25mA, which is
5x the cost of idle. Pushing down filters to the hardware so it can
drop the packet without waking up the CPU thus saves a lot of idle
power.

That said, getting BPF to the driver is part of the picture. On the
chipsets we're targeting for APF, we're only seeing 2k-4k of memory
(that's 256-512 BPF instructions) available for filtering code, which
means that BPF might be too large.


Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop

2016-04-02 Thread Tom Herbert
On Sat, Apr 2, 2016 at 2:41 PM, Johannes Berg  wrote:
> On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote:
>> This patch set introduces new infrastructure for programmatically
>> processing packets in the earliest stages of rx, as part of an effort
>> others are calling Express Data Path (XDP) [1]. Start this effort by
>> introducing a new bpf program type for early packet filtering, before
>> even
>> an skb has been allocated.
>>
>> With this, hope to enable line rate filtering, with this initial
>> implementation providing drop/allow action only.
>
> Since this is handed to the driver in some way, I assume the API would
> also allow offloading the program to the NIC itself, and as such be
> useful for what Android wants to do to save power in wireless?
>
Conceptually, yes. There is some ongoing work to offload BPF and one
goal is that BPF programs (like for XDP) could be portable between
userspace, kernel (maybe even other OSes), and devices.

I am curious though, how do you think this would specifically help
Android with power? Seems like the receiver still needs to be powered
to receive packets to filter them anyway...

Thanks,
Tom

> johannes


Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop

2016-04-02 Thread Johannes Berg
On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote:
> This patch set introduces new infrastructure for programmatically
> processing packets in the earliest stages of rx, as part of an effort
> others are calling Express Data Path (XDP) [1]. Start this effort by
> introducing a new bpf program type for early packet filtering, before
> even
> an skb has been allocated.
> 
> With this, hope to enable line rate filtering, with this initial
> implementation providing drop/allow action only.

Since this is handed to the driver in some way, I assume the API would
also allow offloading the program to the NIC itself, and as such be
useful for what Android wants to do to save power in wireless?

johannes


Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop

2016-04-02 Thread Tom Herbert
On Fri, Apr 1, 2016 at 9:21 PM, Brenden Blanco  wrote:
> This patch set introduces new infrastructure for programmatically
> processing packets in the earliest stages of rx, as part of an effort
> others are calling Express Data Path (XDP) [1]. Start this effort by
> introducing a new bpf program type for early packet filtering, before even
> an skb has been allocated.
>
> With this, hope to enable line rate filtering, with this initial
> implementation providing drop/allow action only.
>
> Patch 1 introduces the new prog type and helpers for validating the bpf
> program. A new userspace struct is defined containing only len as a field,
> with others to follow in the future.
> In patch 2, create a new ndo to pass the fd to support drivers.
> In patch 3, expose a new rtnl option to userspace.
> In patch 4, enable support in mlx4 driver. No skb allocation is required,
> instead a static percpu skb is kept in the driver and minimally initialized
> for each driver frag.
> In patch 5, create a sample drop and count program. With single core,
> achieved ~14.5 Mpps drop rate on a 40G mlx4. This includes packet data
> access, bpf array lookup, and increment.
>
Very nice! Do you think this hook will be sufficient to implement a
fast forward patch also?

Tom

> Interestingly, accessing packet data from the program did not have a
> noticeable impact on performance. Even so, future enhancements to
> prefetching / batching / page-allocs should hopefully improve the
> performance in this path.
>
> [1] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf
>
> Brenden Blanco (5):
>   bpf: add PHYS_DEV prog type for early driver filter
>   net: add ndo to set bpf prog in adapter rx
>   rtnl: add option for setting link bpf prog
>   mlx4: add support for fast rx drop bpf program
>   Add sample for adding simple drop program to link
>
>  drivers/net/ethernet/mellanox/mlx4/en_netdev.c |  61 ++
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c |  18 +++
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |   2 +
>  include/linux/netdevice.h  |   8 ++
>  include/uapi/linux/bpf.h   |   5 +
>  include/uapi/linux/if_link.h   |   1 +
>  kernel/bpf/verifier.c  |   1 +
>  net/core/dev.c |  12 ++
>  net/core/filter.c  |  68 +++
>  net/core/rtnetlink.c   |  10 ++
>  samples/bpf/Makefile   |   4 +
>  samples/bpf/bpf_load.c |   8 ++
>  samples/bpf/netdrvx1_kern.c|  26 +
>  samples/bpf/netdrvx1_user.c| 155 
> +
>  14 files changed, 379 insertions(+)
>  create mode 100644 samples/bpf/netdrvx1_kern.c
>  create mode 100644 samples/bpf/netdrvx1_user.c
>
> --
> 2.8.0
>


[RFC PATCH 0/5] Add driver bpf hook for early packet drop

2016-04-01 Thread Brenden Blanco
This patch set introduces new infrastructure for programmatically
processing packets in the earliest stages of rx, as part of an effort
others are calling Express Data Path (XDP) [1]. Start this effort by
introducing a new bpf program type for early packet filtering, before even
an skb has been allocated.

With this, hope to enable line rate filtering, with this initial
implementation providing drop/allow action only.

Patch 1 introduces the new prog type and helpers for validating the bpf
program. A new userspace struct is defined containing only len as a field,
with others to follow in the future.
In patch 2, create a new ndo to pass the fd to support drivers. 
In patch 3, expose a new rtnl option to userspace.
In patch 4, enable support in mlx4 driver. No skb allocation is required,
instead a static percpu skb is kept in the driver and minimally initialized
for each driver frag.
In patch 5, create a sample drop and count program. With single core,
achieved ~14.5 Mpps drop rate on a 40G mlx4. This includes packet data
access, bpf array lookup, and increment.

Interestingly, accessing packet data from the program did not have a
noticeable impact on performance. Even so, future enhancements to
prefetching / batching / page-allocs should hopefully improve the
performance in this path.

[1] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf

Brenden Blanco (5):
  bpf: add PHYS_DEV prog type for early driver filter
  net: add ndo to set bpf prog in adapter rx
  rtnl: add option for setting link bpf prog
  mlx4: add support for fast rx drop bpf program
  Add sample for adding simple drop program to link

 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |  61 ++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |  18 +++
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |   2 +
 include/linux/netdevice.h  |   8 ++
 include/uapi/linux/bpf.h   |   5 +
 include/uapi/linux/if_link.h   |   1 +
 kernel/bpf/verifier.c  |   1 +
 net/core/dev.c |  12 ++
 net/core/filter.c  |  68 +++
 net/core/rtnetlink.c   |  10 ++
 samples/bpf/Makefile   |   4 +
 samples/bpf/bpf_load.c |   8 ++
 samples/bpf/netdrvx1_kern.c|  26 +
 samples/bpf/netdrvx1_user.c| 155 +
 14 files changed, 379 insertions(+)
 create mode 100644 samples/bpf/netdrvx1_kern.c
 create mode 100644 samples/bpf/netdrvx1_user.c

-- 
2.8.0