Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support

2017-11-16 Thread Jesper Dangaard Brouer

On Thu, 16 Nov 2017 08:09:04 +0100 Björn Töpel  wrote:

> Ideally, it would be best not having to introduce yet another xmit
> ndo. I believe ndo_xdp_xmit/ndo_xdp_flush would be the best fit, but
> we need to extend it with a destructor callback and potentially some
> kind of DMA trait. Why DMA? For zerocopy, we know the working set of
> packet buffers, so they are DMA mapped up front, whereas ndo_xdp_xmit
> does yet another DMA mapping. Paying for the DMA mapping in the
> fast-path is something we'd like to avoid.

I like your idea of reusing ndo_xdp_xmit/ndo_xdp_flush.  At NetConf I
think we agreed that the ndo_xdp_xmit API likely need to change. See[1]
slide 11.  Andy Gospodarek and Michael Chan wanted to look into the
needed API changes (Cc'ed) thus, lets keep them in the loop.

I also appreciate that you are thinking about avoiding the DMA-mapping
at TX.  It would be a welcomed optimization.

[1] 
http://people.netfilter.org/hawk/presentations/NetConf2017_Seoul/XDP_devel_update_NetConf2017_Seoul.pdf
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support (AF_XDP or AF_CHANNEL?)

2017-11-16 Thread Jesper Dangaard Brouer

On Tue, 14 Nov 2017 20:01:01 +0100 Björn Töpel  wrote:

> 2017-11-14 18:19 GMT+01:00 Jesper Dangaard Brouer :
> >
> > On Mon, 13 Nov 2017 22:07:47 +0900 Björn Töpel  
> > wrote:
> >  
> >> I'll summarize the major points, that we'll address in the next RFC
> >> below.
> >>
> >> * Instead of extending AF_PACKET with yet another version, introduce a
> >>   new address/packet family. As for naming had some name suggestions:
> >>   AF_CAPTURE, AF_CHANNEL, AF_XDP and AF_ZEROCOPY. We'll go for
> >>   AF_ZEROCOPY, unless there're no strong opinions against it.  
> >
> > I mostly like AF_CHANNEL and AF_XDP. I do know XDP is/have-evolved-into
> > a kernel-side facility, that moves XDP-frames/packets _inside_ the
> > kernel.
> >
> > *BUT* I've always imagined, that we would create a "channel" to
> > userspace.  By using XDP_REDIRECT to choose what frames get redirected
> > into which userspace "channel" (new channel-map type).  Userspace
> > pre-allocate and register memory/pages exactly like this patchset.
> >
> > [Step-1]: (non-ZC) XDP_REDIRECT need to copy frame-data into userspace
> > memory pages.  And update your packet_array etc. (Use map-flush to get
> > RX bulking).
> >
> > [Step 2]: (ZC) Userspace call driver NDO to register pages. The
> > XDP_REDIRECT action happens in driver, and can have knowledge about
> > RX-ring.  It can know if this RX-ring is Zero-Copy enabled and can skip
> > the copy-step.
> >  
> 
> Jesper, I *really* like this approach -- especially the fact that the
> existing XDP path in the drivers can be reused. I'll spend some time
> dissecting the details of your suggestion.

I'm very happy that you like this approach :-)

> >> * No explicit zerocopy enablement. Use the zeropcopy path if
> >>   supported, if not -- fallback to the skb path, for netdevs that
> >>   don't support the required ndos.  
> >
> > When driver does not support NDO in above model. I think, that there
> > will still be a significant performance boost for the non-ZC variant.
> > Even-though we need a copy-operation, because there are no memory
> > allocations.  As userspace have preallocated and registered pages with
> > the kernel (and mem-limits are implicit via mem-size reg by userspace).
> >  
> 
> Yup, and we're not paying for the whole skb creation, given that we
> execute from XDP_DRV and not XDP_SKB.

Yes, exactly. Avoiding the SKB allocation for non-ZC mode will be a
significant saving.  As your benchmarks showed, the AF_PACKET-V4
approach for non-ZC mode does not give you/us any real performance
improvement.  This approach would.


> >> * Do not introduce a new XDP action XDP_PASS_TO_KERNEL, instead use
> >>   XDP redirect map call with ingress flag.  
> >
> > In above model, XDP_REDIRECT is used for filtering into a userspace
> > "channel".  If ZC gets enabled on a RX-ring queue, then XDP_PASS have
> > to do a copy (RX-ring knowledge is avail), like you describe with
> > XDP_PASS_TO_KERNEL.
> >  
> 
> Again, this fits nicely in.
> 
> >> * Extend the XDP redirect to support explicit allocator/destructor
> >>   functions. Right now, XDP redirect assumes that the page allocator
> >>   was used, and the XDP redirect cleanup path is decreasing the page
> >>   count of the XDP buffer. This assumption breaks for the zerocopy
> >>   case.  
> >
> > Yes, please.  If XDP_REDIRECT get call a destructor call-back, then we
> > can allow XDP_REDIRECT out another net_device, even-when ZC is enabled
> > on a RX-ring queue.

I will (of-cause) be eager to test and benchmark this approach, as I
have high hopes a performance boost even for non-ZC.  I know an AF_XDP
approach is a lot of work, but I would like to offer to help-out in
anyway I can.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support

2017-11-15 Thread Björn Töpel
2017-11-16 4:35 GMT+01:00 Willem de Bruijn :
> On Wed, Nov 15, 2017 at 9:55 PM, Alexei Starovoitov  wrote:
>> On 11/14/17 4:20 AM, Willem de Bruijn wrote:
>>>
>>>
>>> * Limit the scope of the first patchset to Rx only, and introduce Tx
>>>   in a separate patchset.
>>
>>
>>
>> all sounds good to me except above bit.
>> I don't remember people suggesting to split it this way.
>> What's the value of it without tx?
>>
>
> We definitely need Tx for our use-cases! I'll rephrase, so the
> idea was making the initial patch set without Tx *driver*
> specific code, e.g. use ndo_xdp_xmit/flush at a later point.
>
> So AF_ZEROCOPY, the socket parts, would have Tx support.
>
> @John Did I recall that correctly?
>

 Yep, that is what I said. However, on second thought, without the
 driver tx half I guess tx will be significantly slower.
>>>
>>>
>>> The idea was that existing packet rings already send without
>>> copying, so the benefit from device driver changes is not obvious.
>>>
>>> I would leave them out for now and evaluate before possibly
>>> sending a separate patchset.
>>
>>
>> are you suggesting to use new af_zerocopy for rx and old
>> af_packet for tx ? imo that's too cumbersome to use.
>> New interface has to be symmetrical from the start.
>
> No, that tx can be implemented without device driver
> changes. At least initially.
>
> Unlike rx, tx does not need driver support to implement
> copy avoidance, as pf_packet tx_ring already has this.
>
> Having to go through ndo_start_xmit does introduce other
> overhead, notably skb alloc. Perhaps ndo_xdp_xmit is a
> better choice (but I'm not very familiar with that).
>
> If some cost is inherent to a device-independent solution
> and needs driver support to avoid it, then that can be added
> in a follow-on patchset. But this one is large already without
> the i40e tx patch.

Ideally, it would be best not having to introduce yet another xmit
ndo. I believe ndo_xdp_xmit/ndo_xdp_flush would be the best fit, but
we need to extend it with a destructor callback and potentially some
kind of DMA trait. Why DMA? For zerocopy, we know the working set of
packet buffers, so they are DMA mapped up front, whereas ndo_xdp_xmit
does yet another DMA mapping. Paying for the DMA mapping in the
fast-path is something we'd like to avoid.


Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support

2017-11-15 Thread Willem de Bruijn
On Wed, Nov 15, 2017 at 9:55 PM, Alexei Starovoitov  wrote:
> On 11/14/17 4:20 AM, Willem de Bruijn wrote:
>>
>>
>> * Limit the scope of the first patchset to Rx only, and introduce Tx
>>   in a separate patchset.
>
>
>
> all sounds good to me except above bit.
> I don't remember people suggesting to split it this way.
> What's the value of it without tx?
>

 We definitely need Tx for our use-cases! I'll rephrase, so the
 idea was making the initial patch set without Tx *driver*
 specific code, e.g. use ndo_xdp_xmit/flush at a later point.

 So AF_ZEROCOPY, the socket parts, would have Tx support.

 @John Did I recall that correctly?

>>>
>>> Yep, that is what I said. However, on second thought, without the
>>> driver tx half I guess tx will be significantly slower.
>>
>>
>> The idea was that existing packet rings already send without
>> copying, so the benefit from device driver changes is not obvious.
>>
>> I would leave them out for now and evaluate before possibly
>> sending a separate patchset.
>
>
> are you suggesting to use new af_zerocopy for rx and old
> af_packet for tx ? imo that's too cumbersome to use.
> New interface has to be symmetrical from the start.

No, that tx can be implemented without device driver
changes. At least initially.

Unlike rx, tx does not need driver support to implement
copy avoidance, as pf_packet tx_ring already has this.

Having to go through ndo_start_xmit does introduce other
overhead, notably skb alloc. Perhaps ndo_xdp_xmit is a
better choice (but I'm not very familiar with that).

If some cost is inherent to a device-independent solution
and needs driver support to avoid it, then that can be added
in a follow-on patchset. But this one is large already without
the i40e tx patch.


Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support

2017-11-15 Thread Alexei Starovoitov

On 11/14/17 4:20 AM, Willem de Bruijn wrote:


* Limit the scope of the first patchset to Rx only, and introduce Tx
  in a separate patchset.



all sounds good to me except above bit.
I don't remember people suggesting to split it this way.
What's the value of it without tx?



We definitely need Tx for our use-cases! I'll rephrase, so the
idea was making the initial patch set without Tx *driver*
specific code, e.g. use ndo_xdp_xmit/flush at a later point.

So AF_ZEROCOPY, the socket parts, would have Tx support.

@John Did I recall that correctly?



Yep, that is what I said. However, on second thought, without the
driver tx half I guess tx will be significantly slower.


The idea was that existing packet rings already send without
copying, so the benefit from device driver changes is not obvious.

I would leave them out for now and evaluate before possibly
sending a separate patchset.


are you suggesting to use new af_zerocopy for rx and old
af_packet for tx ? imo that's too cumbersome to use.
New interface has to be symmetrical from the start.



Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support (AF_XDP or AF_CHANNEL?)

2017-11-14 Thread Björn Töpel
2017-11-14 18:19 GMT+01:00 Jesper Dangaard Brouer :
>
> On Mon, 13 Nov 2017 22:07:47 +0900 Björn Töpel  wrote:
>
>> I'll summarize the major points, that we'll address in the next RFC
>> below.
>>
>> * Instead of extending AF_PACKET with yet another version, introduce a
>>   new address/packet family. As for naming had some name suggestions:
>>   AF_CAPTURE, AF_CHANNEL, AF_XDP and AF_ZEROCOPY. We'll go for
>>   AF_ZEROCOPY, unless there're no strong opinions against it.
>
> I mostly like AF_CHANNEL and AF_XDP. I do know XDP is/have-evolved-into
> a kernel-side facility, that moves XDP-frames/packets _inside_ the
> kernel.
>
> *BUT* I've always imagined, that we would create a "channel" to
> userspace.  By using XDP_REDIRECT to choose what frames get redirected
> into which userspace "channel" (new channel-map type).  Userspace
> pre-allocate and register memory/pages exactly like this patchset.
>
> [Step-1]: (non-ZC) XDP_REDIRECT need to copy frame-data into userspace
> memory pages.  And update your packet_array etc. (Use map-flush to get
> RX bulking).
>
> [Step 2]: (ZC) Userspace call driver NDO to register pages. The
> XDP_REDIRECT action happens in driver, and can have knowledge about
> RX-ring.  It can know if this RX-ring is Zero-Copy enabled and can skip
> the copy-step.
>

Jesper, I *really* like this approach -- especially the fact that the
existing XDP path in the drivers can be reused. I'll spend some time
dissecting the details of your suggestion.

>> * No explicit zerocopy enablement. Use the zeropcopy path if
>>   supported, if not -- fallback to the skb path, for netdevs that
>>   don't support the required ndos.
>
> When driver does not support NDO in above model. I think, that there
> will still be a significant performance boost for the non-ZC variant.
> Even-though we need a copy-operation, because there are no memory
> allocations.  As userspace have preallocated and registered pages with
> the kernel (and mem-limits are implicit via mem-size reg by userspace).
>

Yup, and we're not paying for the whole skb creation, given that we
execute from XDP_DRV and not XDP_SKB.

>> * Do not introduce a new XDP action XDP_PASS_TO_KERNEL, instead use
>>   XDP redirect map call with ingress flag.
>
> In above model, XDP_REDIRECT is used for filtering into a userspace
> "channel".  If ZC gets enabled on a RX-ring queue, then XDP_PASS have
> to do a copy (RX-ring knowledge is avail), like you describe with
> XDP_PASS_TO_KERNEL.
>

Again, this fits nicely in.

>> * Extend the XDP redirect to support explicit allocator/destructor
>>   functions. Right now, XDP redirect assumes that the page allocator
>>   was used, and the XDP redirect cleanup path is decreasing the page
>>   count of the XDP buffer. This assumption breaks for the zerocopy
>>   case.
>
> Yes, please.  If XDP_REDIRECT get call a destructor call-back, then we
> can allow XDP_REDIRECT out another net_device, even-when ZC is enabled
> on a RX-ring queue.
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer


Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support (AF_XDP or AF_CHANNEL?)

2017-11-14 Thread Jesper Dangaard Brouer

On Mon, 13 Nov 2017 22:07:47 +0900 Björn Töpel  wrote:

> I'll summarize the major points, that we'll address in the next RFC
> below.
> 
> * Instead of extending AF_PACKET with yet another version, introduce a
>   new address/packet family. As for naming had some name suggestions:
>   AF_CAPTURE, AF_CHANNEL, AF_XDP and AF_ZEROCOPY. We'll go for
>   AF_ZEROCOPY, unless there're no strong opinions against it.

I mostly like AF_CHANNEL and AF_XDP. I do know XDP is/have-evolved-into
a kernel-side facility, that moves XDP-frames/packets _inside_ the
kernel.

*BUT* I've always imagined, that we would create a "channel" to
userspace.  By using XDP_REDIRECT to choose what frames get redirected
into which userspace "channel" (new channel-map type).  Userspace
pre-allocate and register memory/pages exactly like this patchset.

[Step-1]: (non-ZC) XDP_REDIRECT need to copy frame-data into userspace
memory pages.  And update your packet_array etc. (Use map-flush to get
RX bulking).

[Step 2]: (ZC) Userspace call driver NDO to register pages. The
XDP_REDIRECT action happens in driver, and can have knowledge about
RX-ring.  It can know if this RX-ring is Zero-Copy enabled and can skip
the copy-step.


> * No explicit zerocopy enablement. Use the zeropcopy path if
>   supported, if not -- fallback to the skb path, for netdevs that
>   don't support the required ndos.

When driver does not support NDO in above model. I think, that there
will still be a significant performance boost for the non-ZC variant.
Even-though we need a copy-operation, because there are no memory
allocations.  As userspace have preallocated and registered pages with
the kernel (and mem-limits are implicit via mem-size reg by userspace).


> * Do not introduce a new XDP action XDP_PASS_TO_KERNEL, instead use
>   XDP redirect map call with ingress flag.

In above model, XDP_REDIRECT is used for filtering into a userspace
"channel".  If ZC gets enabled on a RX-ring queue, then XDP_PASS have
to do a copy (RX-ring knowledge is avail), like you describe with
XDP_PASS_TO_KERNEL.


> * Extend the XDP redirect to support explicit allocator/destructor
>   functions. Right now, XDP redirect assumes that the page allocator
>   was used, and the XDP redirect cleanup path is decreasing the page
>   count of the XDP buffer. This assumption breaks for the zerocopy
>   case.

Yes, please.  If XDP_REDIRECT get call a destructor call-back, then we
can allow XDP_REDIRECT out another net_device, even-when ZC is enabled
on a RX-ring queue.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support

2017-11-14 Thread Willem de Bruijn

 * Limit the scope of the first patchset to Rx only, and introduce Tx
   in a separate patchset.
>>>
>>>
>>> all sounds good to me except above bit.
>>> I don't remember people suggesting to split it this way.
>>> What's the value of it without tx?
>>>
>>
>> We definitely need Tx for our use-cases! I'll rephrase, so the
>> idea was making the initial patch set without Tx *driver*
>> specific code, e.g. use ndo_xdp_xmit/flush at a later point.
>>
>> So AF_ZEROCOPY, the socket parts, would have Tx support.
>>
>> @John Did I recall that correctly?
>>
>
> Yep, that is what I said. However, on second thought, without the
> driver tx half I guess tx will be significantly slower.

The idea was that existing packet rings already send without
copying, so the benefit from device driver changes is not obvious.

I would leave them out for now and evaluate before possibly
sending a separate patchset.


Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support

2017-11-13 Thread John Fastabend
On 11/13/2017 09:33 PM, Björn Töpel wrote:
> 2017-11-14 0:50 GMT+01:00 Alexei Starovoitov :
>> On 11/13/17 9:07 PM, Björn Töpel wrote:
>>>
>>> 2017-10-31 13:41 GMT+01:00 Björn Töpel :

 From: Björn Töpel 

>>> [...]


 We'll do a presentation on AF_PACKET V4 in NetDev 2.2 [1] Seoul,
 Korea, and our paper with complete benchmarks will be released shortly
 on the NetDev 2.2 site.

>>>
>>> We're back in the saddle after an excellent netdevconf week. Kudos to
>>> the organizers; We had a blast! Thanks for all the constructive
>>> feedback.
>>>
>>> I'll summarize the major points, that we'll address in the next RFC
>>> below.
>>>
>>> * Instead of extending AF_PACKET with yet another version, introduce a
>>>   new address/packet family. As for naming had some name suggestions:
>>>   AF_CAPTURE, AF_CHANNEL, AF_XDP and AF_ZEROCOPY. We'll go for
>>>   AF_ZEROCOPY, unless there're no strong opinions against it.
>>>
>>> * No explicit zerocopy enablement. Use the zeropcopy path if
>>>   supported, if not -- fallback to the skb path, for netdevs that
>>>   don't support the required ndos. Further, we'll have the zerocopy
>>>   behavior for the skb path as well, meaning that an AF_ZEROCOPY
>>>   socket will consume the skb and we'll honor skb->queue_mapping,
>>>   meaning that we only consume the packets for the enabled queue.
>>>
>>> * Limit the scope of the first patchset to Rx only, and introduce Tx
>>>   in a separate patchset.
>>
>>
>> all sounds good to me except above bit.
>> I don't remember people suggesting to split it this way.
>> What's the value of it without tx?
>>
> 
> We definitely need Tx for our use-cases! I'll rephrase, so the
> idea was making the initial patch set without Tx *driver*
> specific code, e.g. use ndo_xdp_xmit/flush at a later point.
> 
> So AF_ZEROCOPY, the socket parts, would have Tx support.
> 
> @John Did I recall that correctly?
> 

Yep, that is what I said. However, on second thought, without the
driver tx half I guess tx will be significantly slower. So in order
to get the driver API correct in the first go around lets implement
this in the first series as well.

Just try to minimize the TX driver work as much as possible.

Thanks,
John



Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support

2017-11-13 Thread Björn Töpel
2017-11-14 0:50 GMT+01:00 Alexei Starovoitov :
> On 11/13/17 9:07 PM, Björn Töpel wrote:
>>
>> 2017-10-31 13:41 GMT+01:00 Björn Töpel :
>>>
>>> From: Björn Töpel 
>>>
>> [...]
>>>
>>>
>>> We'll do a presentation on AF_PACKET V4 in NetDev 2.2 [1] Seoul,
>>> Korea, and our paper with complete benchmarks will be released shortly
>>> on the NetDev 2.2 site.
>>>
>>
>> We're back in the saddle after an excellent netdevconf week. Kudos to
>> the organizers; We had a blast! Thanks for all the constructive
>> feedback.
>>
>> I'll summarize the major points, that we'll address in the next RFC
>> below.
>>
>> * Instead of extending AF_PACKET with yet another version, introduce a
>>   new address/packet family. As for naming had some name suggestions:
>>   AF_CAPTURE, AF_CHANNEL, AF_XDP and AF_ZEROCOPY. We'll go for
>>   AF_ZEROCOPY, unless there're no strong opinions against it.
>>
>> * No explicit zerocopy enablement. Use the zeropcopy path if
>>   supported, if not -- fallback to the skb path, for netdevs that
>>   don't support the required ndos. Further, we'll have the zerocopy
>>   behavior for the skb path as well, meaning that an AF_ZEROCOPY
>>   socket will consume the skb and we'll honor skb->queue_mapping,
>>   meaning that we only consume the packets for the enabled queue.
>>
>> * Limit the scope of the first patchset to Rx only, and introduce Tx
>>   in a separate patchset.
>
>
> all sounds good to me except above bit.
> I don't remember people suggesting to split it this way.
> What's the value of it without tx?
>

We definitely need Tx for our use-cases! I'll rephrase, so the
idea was making the initial patch set without Tx *driver*
specific code, e.g. use ndo_xdp_xmit/flush at a later point.

So AF_ZEROCOPY, the socket parts, would have Tx support.

@John Did I recall that correctly?

>> * Minimize the size of the i40e zerocopy patches, by moving the driver
>>   specific code to separate patches.
>>
>> * Do not introduce a new XDP action XDP_PASS_TO_KERNEL, instead use
>>   XDP redirect map call with ingress flag.
>>
>> * Extend the XDP redirect to support explicit allocator/destructor
>>   functions. Right now, XDP redirect assumes that the page allocator
>>   was used, and the XDP redirect cleanup path is decreasing the page
>>   count of the XDP buffer. This assumption breaks for the zerocopy
>>   case.
>>
>>
>> Björn
>>
>>
>>> We based this patch set on net-next commit e1ea2f9856b7 ("Merge
>>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net").
>>>
>>> Please focus your review on:
>>>
>>> * The V4 user space interface
>>> * PACKET_ZEROCOPY and its semantics
>>> * Packet array interface
>>> * XDP semantics when excuting in zero-copy mode (user space passed
>>>   buffers)
>>> * XDP_PASS_TO_KERNEL semantics
>>>
>>> To do:
>>>
>>> * Investigate the user-space ring structure’s performance problems
>>> * Continue the XDP integration into packet arrays
>>> * Optimize performance
>>> * SKB <-> V4 conversions in tp4a_populate & tp4a_flush
>>> * Packet buffer is unnecessarily pinned for virtual devices
>>> * Support shared packet buffers
>>> * Unify V4 and SKB receive path in I40E driver
>>> * Support for packets spanning multiple frames
>>> * Disassociate the packet array implementation from the V4 queue
>>>   structure
>>>
>>> We would really like to thank the reviewers of the limited
>>> distribution RFC for all their comments that have helped improve the
>>> interfaces and the code significantly: Alexei Starovoitov, Alexander
>>> Duyck, Jesper Dangaard Brouer, and John Fastabend. The internal team
>>> at Intel that has been helping out reviewing code, writing tests, and
>>> sanity checking our ideas: Rami Rosen, Jeff Shaw, Ferruh Yigit, and Qi
>>> Zhang, your participation has really helped.
>>>
>>> Thanks: Björn and Magnus
>>>
>>> [1]
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.netdevconf.org_2.2_=DwIFaQ=5VD0RTtNlTh3ycd41b3MUw=qR6oNZj1CqLATni4ibTgAQ=lKyFxON3kKygiOgECLBfmqRwM7ZyXFSUvLED1vP-gos=44jzm1W8xkGyZSZVANRygzHz6y4XHbYrYBRM-K5RhTc=
>>>
>>>
>>> Björn Töpel (7):
>>>   packet: introduce AF_PACKET V4 userspace API
>>>   packet: implement PACKET_MEMREG setsockopt
>>>   packet: enable AF_PACKET V4 rings
>>>   packet: wire up zerocopy for AF_PACKET V4
>>>   i40e: AF_PACKET V4 ndo_tp4_zerocopy Rx support
>>>   i40e: AF_PACKET V4 ndo_tp4_zerocopy Tx support
>>>   samples/tpacket4: added tpbench
>>>
>>> Magnus Karlsson (7):
>>>   packet: enable Rx for AF_PACKET V4
>>>   packet: enable Tx support for AF_PACKET V4
>>>   netdevice: add AF_PACKET V4 zerocopy ops
>>>   veth: added support for PACKET_ZEROCOPY
>>>   samples/tpacket4: added veth support
>>>   i40e: added XDP support for TP4 enabled queue pairs
>>>   xdp: introducing XDP_PASS_TO_KERNEL for PACKET_ZEROCOPY use
>>>
>>>  drivers/net/ethernet/intel/i40e/i40e.h |3 +
>>>  drivers/net/ethernet/intel/i40e/i40e_ethtool.c |9 +
>>>  

Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support

2017-11-13 Thread Alexei Starovoitov

On 11/13/17 9:07 PM, Björn Töpel wrote:

2017-10-31 13:41 GMT+01:00 Björn Töpel :

From: Björn Töpel 


[...]


We'll do a presentation on AF_PACKET V4 in NetDev 2.2 [1] Seoul,
Korea, and our paper with complete benchmarks will be released shortly
on the NetDev 2.2 site.



We're back in the saddle after an excellent netdevconf week. Kudos to
the organizers; We had a blast! Thanks for all the constructive
feedback.

I'll summarize the major points, that we'll address in the next RFC
below.

* Instead of extending AF_PACKET with yet another version, introduce a
  new address/packet family. As for naming had some name suggestions:
  AF_CAPTURE, AF_CHANNEL, AF_XDP and AF_ZEROCOPY. We'll go for
  AF_ZEROCOPY, unless there're no strong opinions against it.

* No explicit zerocopy enablement. Use the zeropcopy path if
  supported, if not -- fallback to the skb path, for netdevs that
  don't support the required ndos. Further, we'll have the zerocopy
  behavior for the skb path as well, meaning that an AF_ZEROCOPY
  socket will consume the skb and we'll honor skb->queue_mapping,
  meaning that we only consume the packets for the enabled queue.

* Limit the scope of the first patchset to Rx only, and introduce Tx
  in a separate patchset.


all sounds good to me except above bit.
I don't remember people suggesting to split it this way.
What's the value of it without tx?


* Minimize the size of the i40e zerocopy patches, by moving the driver
  specific code to separate patches.

* Do not introduce a new XDP action XDP_PASS_TO_KERNEL, instead use
  XDP redirect map call with ingress flag.

* Extend the XDP redirect to support explicit allocator/destructor
  functions. Right now, XDP redirect assumes that the page allocator
  was used, and the XDP redirect cleanup path is decreasing the page
  count of the XDP buffer. This assumption breaks for the zerocopy
  case.


Björn



We based this patch set on net-next commit e1ea2f9856b7 ("Merge
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net").

Please focus your review on:

* The V4 user space interface
* PACKET_ZEROCOPY and its semantics
* Packet array interface
* XDP semantics when excuting in zero-copy mode (user space passed
  buffers)
* XDP_PASS_TO_KERNEL semantics

To do:

* Investigate the user-space ring structure’s performance problems
* Continue the XDP integration into packet arrays
* Optimize performance
* SKB <-> V4 conversions in tp4a_populate & tp4a_flush
* Packet buffer is unnecessarily pinned for virtual devices
* Support shared packet buffers
* Unify V4 and SKB receive path in I40E driver
* Support for packets spanning multiple frames
* Disassociate the packet array implementation from the V4 queue
  structure

We would really like to thank the reviewers of the limited
distribution RFC for all their comments that have helped improve the
interfaces and the code significantly: Alexei Starovoitov, Alexander
Duyck, Jesper Dangaard Brouer, and John Fastabend. The internal team
at Intel that has been helping out reviewing code, writing tests, and
sanity checking our ideas: Rami Rosen, Jeff Shaw, Ferruh Yigit, and Qi
Zhang, your participation has really helped.

Thanks: Björn and Magnus

[1] 
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.netdevconf.org_2.2_=DwIFaQ=5VD0RTtNlTh3ycd41b3MUw=qR6oNZj1CqLATni4ibTgAQ=lKyFxON3kKygiOgECLBfmqRwM7ZyXFSUvLED1vP-gos=44jzm1W8xkGyZSZVANRygzHz6y4XHbYrYBRM-K5RhTc=

Björn Töpel (7):
  packet: introduce AF_PACKET V4 userspace API
  packet: implement PACKET_MEMREG setsockopt
  packet: enable AF_PACKET V4 rings
  packet: wire up zerocopy for AF_PACKET V4
  i40e: AF_PACKET V4 ndo_tp4_zerocopy Rx support
  i40e: AF_PACKET V4 ndo_tp4_zerocopy Tx support
  samples/tpacket4: added tpbench

Magnus Karlsson (7):
  packet: enable Rx for AF_PACKET V4
  packet: enable Tx support for AF_PACKET V4
  netdevice: add AF_PACKET V4 zerocopy ops
  veth: added support for PACKET_ZEROCOPY
  samples/tpacket4: added veth support
  i40e: added XDP support for TP4 enabled queue pairs
  xdp: introducing XDP_PASS_TO_KERNEL for PACKET_ZEROCOPY use

 drivers/net/ethernet/intel/i40e/i40e.h |3 +
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |9 +
 drivers/net/ethernet/intel/i40e/i40e_main.c|  837 -
 drivers/net/ethernet/intel/i40e/i40e_txrx.c|  582 -
 drivers/net/ethernet/intel/i40e/i40e_txrx.h|   38 +
 drivers/net/veth.c |  174 +++
 include/linux/netdevice.h  |   16 +
 include/linux/tpacket4.h   | 1502 
 include/uapi/linux/bpf.h   |1 +
 include/uapi/linux/if_packet.h |   65 +-
 net/packet/af_packet.c | 1252 +---
 net/packet/internal.h  |9 +
 samples/tpacket4/Makefile  |   12 +
 samples/tpacket4/bench_all.sh  |   28 

Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support

2017-11-13 Thread John Fastabend
On 11/13/2017 05:07 AM, Björn Töpel wrote:
> 2017-10-31 13:41 GMT+01:00 Björn Töpel :
>> From: Björn Töpel 
>>
> [...]
>>
>> We'll do a presentation on AF_PACKET V4 in NetDev 2.2 [1] Seoul,
>> Korea, and our paper with complete benchmarks will be released shortly
>> on the NetDev 2.2 site.
>>
> 
> We're back in the saddle after an excellent netdevconf week. Kudos to
> the organizers; We had a blast! Thanks for all the constructive
> feedback.
> 
> I'll summarize the major points, that we'll address in the next RFC
> below.
> 
> * Instead of extending AF_PACKET with yet another version, introduce a
>   new address/packet family. As for naming had some name suggestions:
>   AF_CAPTURE, AF_CHANNEL, AF_XDP and AF_ZEROCOPY. We'll go for
>   AF_ZEROCOPY, unless there're no strong opinions against it.
> 

Works for me.

> * No explicit zerocopy enablement. Use the zeropcopy path if
>   supported, if not -- fallback to the skb path, for netdevs that
>   don't support the required ndos. Further, we'll have the zerocopy
>   behavior for the skb path as well, meaning that an AF_ZEROCOPY
>   socket will consume the skb and we'll honor skb->queue_mapping,
>   meaning that we only consume the packets for the enabled queue.
> 
> * Limit the scope of the first patchset to Rx only, and introduce Tx
>   in a separate patchset.
> 
> * Minimize the size of the i40e zerocopy patches, by moving the driver
>   specific code to separate patches.
> 
> * Do not introduce a new XDP action XDP_PASS_TO_KERNEL, instead use
>   XDP redirect map call with ingress flag.
> 

Sounds good we will need to add this as a separate patch series though.

> * Extend the XDP redirect to support explicit allocator/destructor
>   functions. Right now, XDP redirect assumes that the page allocator
>   was used, and the XDP redirect cleanup path is decreasing the page
>   count of the XDP buffer. This assumption breaks for the zerocopy
>   case.
> 

Probably sync with Andy and Jesper on this. I think they are both
looking into something similar.

Thanks,
John

> 
> Björn
> 
> 


Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support

2017-11-13 Thread Björn Töpel
2017-10-31 13:41 GMT+01:00 Björn Töpel :
> From: Björn Töpel 
>
[...]
>
> We'll do a presentation on AF_PACKET V4 in NetDev 2.2 [1] Seoul,
> Korea, and our paper with complete benchmarks will be released shortly
> on the NetDev 2.2 site.
>

We're back in the saddle after an excellent netdevconf week. Kudos to
the organizers; We had a blast! Thanks for all the constructive
feedback.

I'll summarize the major points, that we'll address in the next RFC
below.

* Instead of extending AF_PACKET with yet another version, introduce a
  new address/packet family. As for naming had some name suggestions:
  AF_CAPTURE, AF_CHANNEL, AF_XDP and AF_ZEROCOPY. We'll go for
  AF_ZEROCOPY, unless there're no strong opinions against it.

* No explicit zerocopy enablement. Use the zeropcopy path if
  supported, if not -- fallback to the skb path, for netdevs that
  don't support the required ndos. Further, we'll have the zerocopy
  behavior for the skb path as well, meaning that an AF_ZEROCOPY
  socket will consume the skb and we'll honor skb->queue_mapping,
  meaning that we only consume the packets for the enabled queue.

* Limit the scope of the first patchset to Rx only, and introduce Tx
  in a separate patchset.

* Minimize the size of the i40e zerocopy patches, by moving the driver
  specific code to separate patches.

* Do not introduce a new XDP action XDP_PASS_TO_KERNEL, instead use
  XDP redirect map call with ingress flag.

* Extend the XDP redirect to support explicit allocator/destructor
  functions. Right now, XDP redirect assumes that the page allocator
  was used, and the XDP redirect cleanup path is decreasing the page
  count of the XDP buffer. This assumption breaks for the zerocopy
  case.


Björn


> We based this patch set on net-next commit e1ea2f9856b7 ("Merge
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net").
>
> Please focus your review on:
>
> * The V4 user space interface
> * PACKET_ZEROCOPY and its semantics
> * Packet array interface
> * XDP semantics when excuting in zero-copy mode (user space passed
>   buffers)
> * XDP_PASS_TO_KERNEL semantics
>
> To do:
>
> * Investigate the user-space ring structure’s performance problems
> * Continue the XDP integration into packet arrays
> * Optimize performance
> * SKB <-> V4 conversions in tp4a_populate & tp4a_flush
> * Packet buffer is unnecessarily pinned for virtual devices
> * Support shared packet buffers
> * Unify V4 and SKB receive path in I40E driver
> * Support for packets spanning multiple frames
> * Disassociate the packet array implementation from the V4 queue
>   structure
>
> We would really like to thank the reviewers of the limited
> distribution RFC for all their comments that have helped improve the
> interfaces and the code significantly: Alexei Starovoitov, Alexander
> Duyck, Jesper Dangaard Brouer, and John Fastabend. The internal team
> at Intel that has been helping out reviewing code, writing tests, and
> sanity checking our ideas: Rami Rosen, Jeff Shaw, Ferruh Yigit, and Qi
> Zhang, your participation has really helped.
>
> Thanks: Björn and Magnus
>
> [1] https://www.netdevconf.org/2.2/
>
> Björn Töpel (7):
>   packet: introduce AF_PACKET V4 userspace API
>   packet: implement PACKET_MEMREG setsockopt
>   packet: enable AF_PACKET V4 rings
>   packet: wire up zerocopy for AF_PACKET V4
>   i40e: AF_PACKET V4 ndo_tp4_zerocopy Rx support
>   i40e: AF_PACKET V4 ndo_tp4_zerocopy Tx support
>   samples/tpacket4: added tpbench
>
> Magnus Karlsson (7):
>   packet: enable Rx for AF_PACKET V4
>   packet: enable Tx support for AF_PACKET V4
>   netdevice: add AF_PACKET V4 zerocopy ops
>   veth: added support for PACKET_ZEROCOPY
>   samples/tpacket4: added veth support
>   i40e: added XDP support for TP4 enabled queue pairs
>   xdp: introducing XDP_PASS_TO_KERNEL for PACKET_ZEROCOPY use
>
>  drivers/net/ethernet/intel/i40e/i40e.h |3 +
>  drivers/net/ethernet/intel/i40e/i40e_ethtool.c |9 +
>  drivers/net/ethernet/intel/i40e/i40e_main.c|  837 -
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c|  582 -
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h|   38 +
>  drivers/net/veth.c |  174 +++
>  include/linux/netdevice.h  |   16 +
>  include/linux/tpacket4.h   | 1502 
> 
>  include/uapi/linux/bpf.h   |1 +
>  include/uapi/linux/if_packet.h |   65 +-
>  net/packet/af_packet.c | 1252 +---
>  net/packet/internal.h  |9 +
>  samples/tpacket4/Makefile  |   12 +
>  samples/tpacket4/bench_all.sh  |   28 +
>  samples/tpacket4/tpbench.c | 1390 ++
>  15 files changed, 5674 insertions(+), 244 deletions(-)
>  create mode 100644 include/linux/tpacket4.h
>  create mode 100644 

Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support

2017-11-03 Thread Willem de Bruijn
On Fri, Nov 3, 2017 at 7:13 PM, Karlsson, Magnus
<magnus.karls...@intel.com> wrote:
>
>
>> -Original Message-
>> From: Willem de Bruijn [mailto:willemdebruijn.ker...@gmail.com]
>> Sent: Friday, November 3, 2017 5:35 AM
>> To: Björn Töpel <bjorn.to...@gmail.com>
>> Cc: Karlsson, Magnus <magnus.karls...@intel.com>; Duyck, Alexander H
>> <alexander.h.du...@intel.com>; Alexander Duyck
>> <alexander.du...@gmail.com>; John Fastabend
>> <john.fastab...@gmail.com>; Alexei Starovoitov <a...@fb.com>; Jesper
>> Dangaard Brouer <bro...@redhat.com>; michael.lundkv...@ericsson.com;
>> ravineet.si...@ericsson.com; Daniel Borkmann <dan...@iogearbox.net>;
>> Network Development <netdev@vger.kernel.org>; Topel, Bjorn
>> <bjorn.to...@intel.com>; Brandeburg, Jesse
>> <jesse.brandeb...@intel.com>; Singhai, Anjali <anjali.sing...@intel.com>;
>> Rosen, Rami <rami.ro...@intel.com>; Shaw, Jeffrey B
>> <jeffrey.b.s...@intel.com>; Yigit, Ferruh <ferruh.yi...@intel.com>; Zhang,
>> Qi Z <qi.z.zh...@intel.com>
>> Subject: Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
>>
>> On Tue, Oct 31, 2017 at 9:41 PM, Björn Töpel <bjorn.to...@gmail.com>
>> wrote:
>> > From: Björn Töpel <bjorn.to...@intel.com>
>> >
>> > This RFC introduces AF_PACKET_V4 and PACKET_ZEROCOPY that are
>> > optimized for high performance packet processing and zero-copy
>> > semantics. Throughput improvements can be up to 40x compared to V2
>> and
>> > V3 for the micro benchmarks included. Would be great to get your
>> > feedback on it.
>> >
>> > The main difference between V4 and V2/V3 is that TX and RX descriptors
>> > are separated from packet buffers.
>>
>> Cool feature. I'm looking forward to the netdev talk. Aside from the inline
>> comments in the patches, a few architecture questions.
>
> Glad to hear. Are you going to Netdev in Seoul? If so, let us hook up
> and discuss your comments in further detail. Some initial thoughts
> below.

Sounds great. I'll be there.

>> Is TX support needed? Existing PACKET_TX_RING already sends out packets
>> without copying directly from the tx_ring. Indirection through a descriptor
>> ring is not helpful on TX if all packets still have to come from a 
>> pre-registered
>> packet pool. The patch set adds a lot of tx-only code and is complex enough
>> without it.
>
> That is correct, but what if the packet you are going to transmit came
> in from the receive path and is already in the packet buffer?

Oh, yes, of course. That is a common use case. I should have
thought of that.

> This
> might happen if the application is examining/sniffing packets then
> sending them out, or doing some modification to them. In that case we
> avoid a copy in V4 since the packet is already in the packet
> buffer. With V2 and V3, a copy from the RX ring to the TX ring would
> be needed. In the PACKET_ZEROCOPY case, avoiding this copy increases
> performance quite a lot.
>
>> Can you use the existing PACKET_V2 format for the packet pool? The
>> v4 format is nearly the same as V2. Using the same version might avoid some
>> code duplication and simplify upgrading existing legacy code.
>> Instead of continuing to add new versions whose behavior is implicit,
>> perhaps we can add explicit mode PACKET_INDIRECT to PACKET_V2.
>
> Interesting idea that I think is worth thinking more about. One
> problem though with the V2 ring format model, and the current V4
> format too by the way, when applied to a user-space allocated memory
> is that they are symmetric, i.e. that user space and kernel space have
> to produce and consume the same amount of entries (within the length
> of the descriptor area). User space sends down a buffer entry that the
> kernel fills in for RX for example. Symmetric queues do not work when
> you have a shared packet buffer between two processes. (This is not a
> requirement, but someone might do a mmap with MAP_SHARED for the
> packet buffer and then fork of a child that then inherits this packet
> buffer.) One of the processes might just receive packets, while the
> other one is transmitting. Or if you have a veth link pair between two
> processes and they have been set up to share packet buffer area. With
> a symmetric queue you have to copy even if they share the same packet
> buffer, but with an asymmetric queue, you do not and the driver only
> needs to copy the packet buffer id between the TX desc ring of the
> sender to the RX desc ring of the receiver, not the data. I think this
> gives an indicatio

RE: [RFC PATCH 00/14] Introducing AF_PACKET V4 support

2017-11-03 Thread Karlsson, Magnus


> -Original Message-
> From: Willem de Bruijn [mailto:willemdebruijn.ker...@gmail.com]
> Sent: Friday, November 3, 2017 5:35 AM
> To: Björn Töpel <bjorn.to...@gmail.com>
> Cc: Karlsson, Magnus <magnus.karls...@intel.com>; Duyck, Alexander H
> <alexander.h.du...@intel.com>; Alexander Duyck
> <alexander.du...@gmail.com>; John Fastabend
> <john.fastab...@gmail.com>; Alexei Starovoitov <a...@fb.com>; Jesper
> Dangaard Brouer <bro...@redhat.com>; michael.lundkv...@ericsson.com;
> ravineet.si...@ericsson.com; Daniel Borkmann <dan...@iogearbox.net>;
> Network Development <netdev@vger.kernel.org>; Topel, Bjorn
> <bjorn.to...@intel.com>; Brandeburg, Jesse
> <jesse.brandeb...@intel.com>; Singhai, Anjali <anjali.sing...@intel.com>;
> Rosen, Rami <rami.ro...@intel.com>; Shaw, Jeffrey B
> <jeffrey.b.s...@intel.com>; Yigit, Ferruh <ferruh.yi...@intel.com>; Zhang,
> Qi Z <qi.z.zh...@intel.com>
> Subject: Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
> 
> On Tue, Oct 31, 2017 at 9:41 PM, Björn Töpel <bjorn.to...@gmail.com>
> wrote:
> > From: Björn Töpel <bjorn.to...@intel.com>
> >
> > This RFC introduces AF_PACKET_V4 and PACKET_ZEROCOPY that are
> > optimized for high performance packet processing and zero-copy
> > semantics. Throughput improvements can be up to 40x compared to V2
> and
> > V3 for the micro benchmarks included. Would be great to get your
> > feedback on it.
> >
> > The main difference between V4 and V2/V3 is that TX and RX descriptors
> > are separated from packet buffers.
> 
> Cool feature. I'm looking forward to the netdev talk. Aside from the inline
> comments in the patches, a few architecture questions.

Glad to hear. Are you going to Netdev in Seoul? If so, let us hook up
and discuss your comments in further detail. Some initial thoughts
below.

> Is TX support needed? Existing PACKET_TX_RING already sends out packets
> without copying directly from the tx_ring. Indirection through a descriptor
> ring is not helpful on TX if all packets still have to come from a 
> pre-registered
> packet pool. The patch set adds a lot of tx-only code and is complex enough
> without it.

That is correct, but what if the packet you are going to transmit came
in from the receive path and is already in the packet buffer? This
might happen if the application is examining/sniffing packets then
sending them out, or doing some modification to them. In that case we
avoid a copy in V4 since the packet is already in the packet
buffer. With V2 and V3, a copy from the RX ring to the TX ring would
be needed. In the PACKET_ZEROCOPY case, avoiding this copy increases
performance quite a lot.

> Can you use the existing PACKET_V2 format for the packet pool? The
> v4 format is nearly the same as V2. Using the same version might avoid some
> code duplication and simplify upgrading existing legacy code.
> Instead of continuing to add new versions whose behavior is implicit,
> perhaps we can add explicit mode PACKET_INDIRECT to PACKET_V2.

Interesting idea that I think is worth thinking more about. One
problem though with the V2 ring format model, and the current V4
format too by the way, when applied to a user-space allocated memory
is that they are symmetric, i.e. that user space and kernel space have
to produce and consume the same amount of entries (within the length
of the descriptor area). User space sends down a buffer entry that the
kernel fills in for RX for example. Symmetric queues do not work when
you have a shared packet buffer between two processes. (This is not a
requirement, but someone might do a mmap with MAP_SHARED for the
packet buffer and then fork of a child that then inherits this packet
buffer.) One of the processes might just receive packets, while the
other one is transmitting. Or if you have a veth link pair between two
processes and they have been set up to share packet buffer area. With
a symmetric queue you have to copy even if they share the same packet
buffer, but with an asymmetric queue, you do not and the driver only
needs to copy the packet buffer id between the TX desc ring of the
sender to the RX desc ring of the receiver, not the data. I think this
gives an indication that we need a new structure. Anyway, I like your
idea and I think it is worth thinking more about it. Let us have a
discussion about this at Netdev, if you are there.

> Finally, is it necessary to define a new descriptor ring format? Same for the
> packet array and frame set. The kernel already has a few, such as virtio for
> the first, skb_array/ptr_ring, even linux list for the second. These 
> containers
> add a lot of new boilerplate code. If new formats are absolutely necessary, at
> least we should consider making them

Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support

2017-11-02 Thread Willem de Bruijn
On Tue, Oct 31, 2017 at 9:41 PM, Björn Töpel  wrote:
> From: Björn Töpel 
>
> This RFC introduces AF_PACKET_V4 and PACKET_ZEROCOPY that are
> optimized for high performance packet processing and zero-copy
> semantics. Throughput improvements can be up to 40x compared to V2 and
> V3 for the micro benchmarks included. Would be great to get your
> feedback on it.
>
> The main difference between V4 and V2/V3 is that TX and RX descriptors
> are separated from packet buffers.

Cool feature. I'm looking forward to the netdev talk. Aside from the
inline comments in the patches, a few architecture questions.

Is TX support needed? Existing PACKET_TX_RING already sends out
packets without copying directly from the tx_ring. Indirection through a
descriptor ring is not helpful on TX if all packets still have to come from
a pre-registered packet pool. The patch set adds a lot of tx-only code
and is complex enough without it.

Can you use the existing PACKET_V2 format for the packet pool? The
v4 format is nearly the same as V2. Using the same version might avoid
some code duplication and simplify upgrading existing legacy code.
Instead of continuing to add new versions whose behavior is implicit,
perhaps we can add explicit mode PACKET_INDIRECT to PACKET_V2.

Finally, is it necessary to define a new descriptor ring format? Same for the
packet array and frame set. The kernel already has a few, such as virtio for
the first, skb_array/ptr_ring, even linux list for the second. These containers
add a lot of new boilerplate code. If new formats are absolutely necessary,
at least we should consider making them generic (like skb_array and
ptr_ring). But I'd like to understand first why, e.g., virtio cannot be used.


[RFC PATCH 00/14] Introducing AF_PACKET V4 support

2017-10-31 Thread Björn Töpel
From: Björn Töpel 

This RFC introduces AF_PACKET_V4 and PACKET_ZEROCOPY that are
optimized for high performance packet processing and zero-copy
semantics. Throughput improvements can be up to 40x compared to V2 and
V3 for the micro benchmarks included. Would be great to get your
feedback on it.

The main difference between V4 and V2/V3 is that TX and RX descriptors
are separated from packet buffers. An RX or TX descriptor points to a
data buffer in a packet buffer area. RX and TX can share the same
packet buffer so that a packet does not have to be copied between RX
and TX. Moreover, if a packet needs to be kept for a while due to a
possible retransmit, then the descriptor that points to that packet
buffer can be changed to point to another buffer and reused right
away. This again avoids copying data.

The RX and TX descriptor rings are registered with the setsockopts
PACKET_RX_RING and PACKET_TX_RING, as usual. The packet buffer area is
allocated by user space and registered with the kernel using the new
PACKET_MEMREG setsockopt. All these three areas are shared between
user space and kernel space.

When V4 executes like this, we say that it executes in
"copy-mode". Each packet is sent to the Linux stack and a copy of it
is sent to user space, so V4 behaves in the same way as V2 and V3. All
syscalls operating on file descriptors should just work as if it was
V2 or V3. However, when the new PACKET_ZEROCOPY setsockopt is called,
V4 starts to operate in true zero-copy mode. In this mode, the
networking HW (or SW driver if it is a virtual driver like veth)
DMAs/puts packets straight into the packet buffer that is shared
between user space and kernel space. The RX and TX descriptor queues
of the networking HW are NOT shared to user space. Only the kernel can
read and write these and it is the kernel drivers responsibility to
translate these HW specific descriptors to the HW agnostic ones in the
V4 virtual descriptor rings that user space sees. This way, a
malicious user space program cannot mess with the networking HW.

The PACKET_ZEROCOPY setsockopt acts on a queue pair (channel in
ethtool speak), so one needs to steer the traffic to the zero-copy
enabled queue pair. Which queue to use, is up to the user.

For an untrusted application, HW packet steering to a specific queue
pair (the one associated with the application) is a requirement, as
the application would otherwise be able to see other user space
processes' packets. If the HW cannot support the required packet
steering, packets need to be DMA:ed into non user-space visible kernel
buffers and from there copied out to user space. This RFC only
addresses NIC HW with packet steering capabilities.

PACKET_ZEROCOPY comes with "XDP batteries included", so XDP programs
will be executed for zero-copy enabled queues. We're also suggesting
adding a new XDP action, XDP_PASS_TO_KERNEL, to pass copies to the
kernel stack instead of the V4 user space queue in zero-copy mode.

There's a tpbench benchmarking/test application included. Say that
you'd like your UDP traffic from port 4242 to end up in queue 16, that
we'll enable zero-copy on. Here, we use ethtool for this:

  ethtool -N p3p2 rx-flow-hash udp4 fn
  ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
  action 16

Running the benchmark in zero-copy mode can then be done using:

  tpbench -i p3p2 --rxdrop --zerocopy 17

Note that the --zerocopy command-line argument is one-based, and not
zero-based.

We've run some benchmarks on a dual socket system with two Broadwell
E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
cores which gives a total of 28, but only two cores are used in these
experiments. One for Tx/Rx and one for the user space application. The
memory is DDR4 @ 1067 MT/s and the size of each DIMM is 8192MB and
with 8 of those DIMMs in the system we have 64 GB of total memory. The
compiler used is gcc version 5.4.0 20160609. The NIC is an Intel I40E
40Gbit/s using the i40e driver.

Below are the results in Mpps of the I40E NIC benchmark runs for 64
byte packets, generated by commercial packet generator HW that is
generating packets at full 40 Gbit/s line rate.

Benchmark   V2 V3 V4 V4+ZC
rxdrop  0.67   0.73   0.74   33.7
txpush  0.98   0.98   0.91   19.6
l2fwd   0.66   0.71   0.67   15.5

The results are generated using the "bench_all.sh" script.

We'll do a presentation on AF_PACKET V4 in NetDev 2.2 [1] Seoul,
Korea, and our paper with complete benchmarks will be released shortly
on the NetDev 2.2 site.

We based this patch set on net-next commit e1ea2f9856b7 ("Merge
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net").

Please focus your review on:

* The V4 user space interface
* PACKET_ZEROCOPY and its semantics
* Packet array interface
* XDP semantics when excuting in zero-copy mode (user space passed
  buffers)
* XDP_PASS_TO_KERNEL semantics

To do:

* Investigate the user-space ring