Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
On Thu, 16 Nov 2017 08:09:04 +0100 Björn Töpelwrote: > Ideally, it would be best not having to introduce yet another xmit > ndo. I believe ndo_xdp_xmit/ndo_xdp_flush would be the best fit, but > we need to extend it with a destructor callback and potentially some > kind of DMA trait. Why DMA? For zerocopy, we know the working set of > packet buffers, so they are DMA mapped up front, whereas ndo_xdp_xmit > does yet another DMA mapping. Paying for the DMA mapping in the > fast-path is something we'd like to avoid. I like your idea of reusing ndo_xdp_xmit/ndo_xdp_flush. At NetConf I think we agreed that the ndo_xdp_xmit API likely need to change. See[1] slide 11. Andy Gospodarek and Michael Chan wanted to look into the needed API changes (Cc'ed) thus, lets keep them in the loop. I also appreciate that you are thinking about avoiding the DMA-mapping at TX. It would be a welcomed optimization. [1] http://people.netfilter.org/hawk/presentations/NetConf2017_Seoul/XDP_devel_update_NetConf2017_Seoul.pdf -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support (AF_XDP or AF_CHANNEL?)
On Tue, 14 Nov 2017 20:01:01 +0100 Björn Töpelwrote: > 2017-11-14 18:19 GMT+01:00 Jesper Dangaard Brouer : > > > > On Mon, 13 Nov 2017 22:07:47 +0900 Björn Töpel > > wrote: > > > >> I'll summarize the major points, that we'll address in the next RFC > >> below. > >> > >> * Instead of extending AF_PACKET with yet another version, introduce a > >> new address/packet family. As for naming had some name suggestions: > >> AF_CAPTURE, AF_CHANNEL, AF_XDP and AF_ZEROCOPY. We'll go for > >> AF_ZEROCOPY, unless there're no strong opinions against it. > > > > I mostly like AF_CHANNEL and AF_XDP. I do know XDP is/have-evolved-into > > a kernel-side facility, that moves XDP-frames/packets _inside_ the > > kernel. > > > > *BUT* I've always imagined, that we would create a "channel" to > > userspace. By using XDP_REDIRECT to choose what frames get redirected > > into which userspace "channel" (new channel-map type). Userspace > > pre-allocate and register memory/pages exactly like this patchset. > > > > [Step-1]: (non-ZC) XDP_REDIRECT need to copy frame-data into userspace > > memory pages. And update your packet_array etc. (Use map-flush to get > > RX bulking). > > > > [Step 2]: (ZC) Userspace call driver NDO to register pages. The > > XDP_REDIRECT action happens in driver, and can have knowledge about > > RX-ring. It can know if this RX-ring is Zero-Copy enabled and can skip > > the copy-step. > > > > Jesper, I *really* like this approach -- especially the fact that the > existing XDP path in the drivers can be reused. I'll spend some time > dissecting the details of your suggestion. I'm very happy that you like this approach :-) > >> * No explicit zerocopy enablement. Use the zeropcopy path if > >> supported, if not -- fallback to the skb path, for netdevs that > >> don't support the required ndos. > > > > When driver does not support NDO in above model. I think, that there > > will still be a significant performance boost for the non-ZC variant. > > Even-though we need a copy-operation, because there are no memory > > allocations. As userspace have preallocated and registered pages with > > the kernel (and mem-limits are implicit via mem-size reg by userspace). > > > > Yup, and we're not paying for the whole skb creation, given that we > execute from XDP_DRV and not XDP_SKB. Yes, exactly. Avoiding the SKB allocation for non-ZC mode will be a significant saving. As your benchmarks showed, the AF_PACKET-V4 approach for non-ZC mode does not give you/us any real performance improvement. This approach would. > >> * Do not introduce a new XDP action XDP_PASS_TO_KERNEL, instead use > >> XDP redirect map call with ingress flag. > > > > In above model, XDP_REDIRECT is used for filtering into a userspace > > "channel". If ZC gets enabled on a RX-ring queue, then XDP_PASS have > > to do a copy (RX-ring knowledge is avail), like you describe with > > XDP_PASS_TO_KERNEL. > > > > Again, this fits nicely in. > > >> * Extend the XDP redirect to support explicit allocator/destructor > >> functions. Right now, XDP redirect assumes that the page allocator > >> was used, and the XDP redirect cleanup path is decreasing the page > >> count of the XDP buffer. This assumption breaks for the zerocopy > >> case. > > > > Yes, please. If XDP_REDIRECT get call a destructor call-back, then we > > can allow XDP_REDIRECT out another net_device, even-when ZC is enabled > > on a RX-ring queue. I will (of-cause) be eager to test and benchmark this approach, as I have high hopes a performance boost even for non-ZC. I know an AF_XDP approach is a lot of work, but I would like to offer to help-out in anyway I can. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
2017-11-16 4:35 GMT+01:00 Willem de Bruijn: > On Wed, Nov 15, 2017 at 9:55 PM, Alexei Starovoitov wrote: >> On 11/14/17 4:20 AM, Willem de Bruijn wrote: >>> >>> >>> * Limit the scope of the first patchset to Rx only, and introduce Tx >>> in a separate patchset. >> >> >> >> all sounds good to me except above bit. >> I don't remember people suggesting to split it this way. >> What's the value of it without tx? >> > > We definitely need Tx for our use-cases! I'll rephrase, so the > idea was making the initial patch set without Tx *driver* > specific code, e.g. use ndo_xdp_xmit/flush at a later point. > > So AF_ZEROCOPY, the socket parts, would have Tx support. > > @John Did I recall that correctly? > Yep, that is what I said. However, on second thought, without the driver tx half I guess tx will be significantly slower. >>> >>> >>> The idea was that existing packet rings already send without >>> copying, so the benefit from device driver changes is not obvious. >>> >>> I would leave them out for now and evaluate before possibly >>> sending a separate patchset. >> >> >> are you suggesting to use new af_zerocopy for rx and old >> af_packet for tx ? imo that's too cumbersome to use. >> New interface has to be symmetrical from the start. > > No, that tx can be implemented without device driver > changes. At least initially. > > Unlike rx, tx does not need driver support to implement > copy avoidance, as pf_packet tx_ring already has this. > > Having to go through ndo_start_xmit does introduce other > overhead, notably skb alloc. Perhaps ndo_xdp_xmit is a > better choice (but I'm not very familiar with that). > > If some cost is inherent to a device-independent solution > and needs driver support to avoid it, then that can be added > in a follow-on patchset. But this one is large already without > the i40e tx patch. Ideally, it would be best not having to introduce yet another xmit ndo. I believe ndo_xdp_xmit/ndo_xdp_flush would be the best fit, but we need to extend it with a destructor callback and potentially some kind of DMA trait. Why DMA? For zerocopy, we know the working set of packet buffers, so they are DMA mapped up front, whereas ndo_xdp_xmit does yet another DMA mapping. Paying for the DMA mapping in the fast-path is something we'd like to avoid.
Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
On Wed, Nov 15, 2017 at 9:55 PM, Alexei Starovoitovwrote: > On 11/14/17 4:20 AM, Willem de Bruijn wrote: >> >> >> * Limit the scope of the first patchset to Rx only, and introduce Tx >> in a separate patchset. > > > > all sounds good to me except above bit. > I don't remember people suggesting to split it this way. > What's the value of it without tx? > We definitely need Tx for our use-cases! I'll rephrase, so the idea was making the initial patch set without Tx *driver* specific code, e.g. use ndo_xdp_xmit/flush at a later point. So AF_ZEROCOPY, the socket parts, would have Tx support. @John Did I recall that correctly? >>> >>> Yep, that is what I said. However, on second thought, without the >>> driver tx half I guess tx will be significantly slower. >> >> >> The idea was that existing packet rings already send without >> copying, so the benefit from device driver changes is not obvious. >> >> I would leave them out for now and evaluate before possibly >> sending a separate patchset. > > > are you suggesting to use new af_zerocopy for rx and old > af_packet for tx ? imo that's too cumbersome to use. > New interface has to be symmetrical from the start. No, that tx can be implemented without device driver changes. At least initially. Unlike rx, tx does not need driver support to implement copy avoidance, as pf_packet tx_ring already has this. Having to go through ndo_start_xmit does introduce other overhead, notably skb alloc. Perhaps ndo_xdp_xmit is a better choice (but I'm not very familiar with that). If some cost is inherent to a device-independent solution and needs driver support to avoid it, then that can be added in a follow-on patchset. But this one is large already without the i40e tx patch.
Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
On 11/14/17 4:20 AM, Willem de Bruijn wrote: * Limit the scope of the first patchset to Rx only, and introduce Tx in a separate patchset. all sounds good to me except above bit. I don't remember people suggesting to split it this way. What's the value of it without tx? We definitely need Tx for our use-cases! I'll rephrase, so the idea was making the initial patch set without Tx *driver* specific code, e.g. use ndo_xdp_xmit/flush at a later point. So AF_ZEROCOPY, the socket parts, would have Tx support. @John Did I recall that correctly? Yep, that is what I said. However, on second thought, without the driver tx half I guess tx will be significantly slower. The idea was that existing packet rings already send without copying, so the benefit from device driver changes is not obvious. I would leave them out for now and evaluate before possibly sending a separate patchset. are you suggesting to use new af_zerocopy for rx and old af_packet for tx ? imo that's too cumbersome to use. New interface has to be symmetrical from the start.
Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support (AF_XDP or AF_CHANNEL?)
2017-11-14 18:19 GMT+01:00 Jesper Dangaard Brouer: > > On Mon, 13 Nov 2017 22:07:47 +0900 Björn Töpel wrote: > >> I'll summarize the major points, that we'll address in the next RFC >> below. >> >> * Instead of extending AF_PACKET with yet another version, introduce a >> new address/packet family. As for naming had some name suggestions: >> AF_CAPTURE, AF_CHANNEL, AF_XDP and AF_ZEROCOPY. We'll go for >> AF_ZEROCOPY, unless there're no strong opinions against it. > > I mostly like AF_CHANNEL and AF_XDP. I do know XDP is/have-evolved-into > a kernel-side facility, that moves XDP-frames/packets _inside_ the > kernel. > > *BUT* I've always imagined, that we would create a "channel" to > userspace. By using XDP_REDIRECT to choose what frames get redirected > into which userspace "channel" (new channel-map type). Userspace > pre-allocate and register memory/pages exactly like this patchset. > > [Step-1]: (non-ZC) XDP_REDIRECT need to copy frame-data into userspace > memory pages. And update your packet_array etc. (Use map-flush to get > RX bulking). > > [Step 2]: (ZC) Userspace call driver NDO to register pages. The > XDP_REDIRECT action happens in driver, and can have knowledge about > RX-ring. It can know if this RX-ring is Zero-Copy enabled and can skip > the copy-step. > Jesper, I *really* like this approach -- especially the fact that the existing XDP path in the drivers can be reused. I'll spend some time dissecting the details of your suggestion. >> * No explicit zerocopy enablement. Use the zeropcopy path if >> supported, if not -- fallback to the skb path, for netdevs that >> don't support the required ndos. > > When driver does not support NDO in above model. I think, that there > will still be a significant performance boost for the non-ZC variant. > Even-though we need a copy-operation, because there are no memory > allocations. As userspace have preallocated and registered pages with > the kernel (and mem-limits are implicit via mem-size reg by userspace). > Yup, and we're not paying for the whole skb creation, given that we execute from XDP_DRV and not XDP_SKB. >> * Do not introduce a new XDP action XDP_PASS_TO_KERNEL, instead use >> XDP redirect map call with ingress flag. > > In above model, XDP_REDIRECT is used for filtering into a userspace > "channel". If ZC gets enabled on a RX-ring queue, then XDP_PASS have > to do a copy (RX-ring knowledge is avail), like you describe with > XDP_PASS_TO_KERNEL. > Again, this fits nicely in. >> * Extend the XDP redirect to support explicit allocator/destructor >> functions. Right now, XDP redirect assumes that the page allocator >> was used, and the XDP redirect cleanup path is decreasing the page >> count of the XDP buffer. This assumption breaks for the zerocopy >> case. > > Yes, please. If XDP_REDIRECT get call a destructor call-back, then we > can allow XDP_REDIRECT out another net_device, even-when ZC is enabled > on a RX-ring queue. > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer
Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support (AF_XDP or AF_CHANNEL?)
On Mon, 13 Nov 2017 22:07:47 +0900 Björn Töpelwrote: > I'll summarize the major points, that we'll address in the next RFC > below. > > * Instead of extending AF_PACKET with yet another version, introduce a > new address/packet family. As for naming had some name suggestions: > AF_CAPTURE, AF_CHANNEL, AF_XDP and AF_ZEROCOPY. We'll go for > AF_ZEROCOPY, unless there're no strong opinions against it. I mostly like AF_CHANNEL and AF_XDP. I do know XDP is/have-evolved-into a kernel-side facility, that moves XDP-frames/packets _inside_ the kernel. *BUT* I've always imagined, that we would create a "channel" to userspace. By using XDP_REDIRECT to choose what frames get redirected into which userspace "channel" (new channel-map type). Userspace pre-allocate and register memory/pages exactly like this patchset. [Step-1]: (non-ZC) XDP_REDIRECT need to copy frame-data into userspace memory pages. And update your packet_array etc. (Use map-flush to get RX bulking). [Step 2]: (ZC) Userspace call driver NDO to register pages. The XDP_REDIRECT action happens in driver, and can have knowledge about RX-ring. It can know if this RX-ring is Zero-Copy enabled and can skip the copy-step. > * No explicit zerocopy enablement. Use the zeropcopy path if > supported, if not -- fallback to the skb path, for netdevs that > don't support the required ndos. When driver does not support NDO in above model. I think, that there will still be a significant performance boost for the non-ZC variant. Even-though we need a copy-operation, because there are no memory allocations. As userspace have preallocated and registered pages with the kernel (and mem-limits are implicit via mem-size reg by userspace). > * Do not introduce a new XDP action XDP_PASS_TO_KERNEL, instead use > XDP redirect map call with ingress flag. In above model, XDP_REDIRECT is used for filtering into a userspace "channel". If ZC gets enabled on a RX-ring queue, then XDP_PASS have to do a copy (RX-ring knowledge is avail), like you describe with XDP_PASS_TO_KERNEL. > * Extend the XDP redirect to support explicit allocator/destructor > functions. Right now, XDP redirect assumes that the page allocator > was used, and the XDP redirect cleanup path is decreasing the page > count of the XDP buffer. This assumption breaks for the zerocopy > case. Yes, please. If XDP_REDIRECT get call a destructor call-back, then we can allow XDP_REDIRECT out another net_device, even-when ZC is enabled on a RX-ring queue. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
* Limit the scope of the first patchset to Rx only, and introduce Tx in a separate patchset. >>> >>> >>> all sounds good to me except above bit. >>> I don't remember people suggesting to split it this way. >>> What's the value of it without tx? >>> >> >> We definitely need Tx for our use-cases! I'll rephrase, so the >> idea was making the initial patch set without Tx *driver* >> specific code, e.g. use ndo_xdp_xmit/flush at a later point. >> >> So AF_ZEROCOPY, the socket parts, would have Tx support. >> >> @John Did I recall that correctly? >> > > Yep, that is what I said. However, on second thought, without the > driver tx half I guess tx will be significantly slower. The idea was that existing packet rings already send without copying, so the benefit from device driver changes is not obvious. I would leave them out for now and evaluate before possibly sending a separate patchset.
Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
On 11/13/2017 09:33 PM, Björn Töpel wrote: > 2017-11-14 0:50 GMT+01:00 Alexei Starovoitov: >> On 11/13/17 9:07 PM, Björn Töpel wrote: >>> >>> 2017-10-31 13:41 GMT+01:00 Björn Töpel : From: Björn Töpel >>> [...] We'll do a presentation on AF_PACKET V4 in NetDev 2.2 [1] Seoul, Korea, and our paper with complete benchmarks will be released shortly on the NetDev 2.2 site. >>> >>> We're back in the saddle after an excellent netdevconf week. Kudos to >>> the organizers; We had a blast! Thanks for all the constructive >>> feedback. >>> >>> I'll summarize the major points, that we'll address in the next RFC >>> below. >>> >>> * Instead of extending AF_PACKET with yet another version, introduce a >>> new address/packet family. As for naming had some name suggestions: >>> AF_CAPTURE, AF_CHANNEL, AF_XDP and AF_ZEROCOPY. We'll go for >>> AF_ZEROCOPY, unless there're no strong opinions against it. >>> >>> * No explicit zerocopy enablement. Use the zeropcopy path if >>> supported, if not -- fallback to the skb path, for netdevs that >>> don't support the required ndos. Further, we'll have the zerocopy >>> behavior for the skb path as well, meaning that an AF_ZEROCOPY >>> socket will consume the skb and we'll honor skb->queue_mapping, >>> meaning that we only consume the packets for the enabled queue. >>> >>> * Limit the scope of the first patchset to Rx only, and introduce Tx >>> in a separate patchset. >> >> >> all sounds good to me except above bit. >> I don't remember people suggesting to split it this way. >> What's the value of it without tx? >> > > We definitely need Tx for our use-cases! I'll rephrase, so the > idea was making the initial patch set without Tx *driver* > specific code, e.g. use ndo_xdp_xmit/flush at a later point. > > So AF_ZEROCOPY, the socket parts, would have Tx support. > > @John Did I recall that correctly? > Yep, that is what I said. However, on second thought, without the driver tx half I guess tx will be significantly slower. So in order to get the driver API correct in the first go around lets implement this in the first series as well. Just try to minimize the TX driver work as much as possible. Thanks, John
Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
2017-11-14 0:50 GMT+01:00 Alexei Starovoitov: > On 11/13/17 9:07 PM, Björn Töpel wrote: >> >> 2017-10-31 13:41 GMT+01:00 Björn Töpel : >>> >>> From: Björn Töpel >>> >> [...] >>> >>> >>> We'll do a presentation on AF_PACKET V4 in NetDev 2.2 [1] Seoul, >>> Korea, and our paper with complete benchmarks will be released shortly >>> on the NetDev 2.2 site. >>> >> >> We're back in the saddle after an excellent netdevconf week. Kudos to >> the organizers; We had a blast! Thanks for all the constructive >> feedback. >> >> I'll summarize the major points, that we'll address in the next RFC >> below. >> >> * Instead of extending AF_PACKET with yet another version, introduce a >> new address/packet family. As for naming had some name suggestions: >> AF_CAPTURE, AF_CHANNEL, AF_XDP and AF_ZEROCOPY. We'll go for >> AF_ZEROCOPY, unless there're no strong opinions against it. >> >> * No explicit zerocopy enablement. Use the zeropcopy path if >> supported, if not -- fallback to the skb path, for netdevs that >> don't support the required ndos. Further, we'll have the zerocopy >> behavior for the skb path as well, meaning that an AF_ZEROCOPY >> socket will consume the skb and we'll honor skb->queue_mapping, >> meaning that we only consume the packets for the enabled queue. >> >> * Limit the scope of the first patchset to Rx only, and introduce Tx >> in a separate patchset. > > > all sounds good to me except above bit. > I don't remember people suggesting to split it this way. > What's the value of it without tx? > We definitely need Tx for our use-cases! I'll rephrase, so the idea was making the initial patch set without Tx *driver* specific code, e.g. use ndo_xdp_xmit/flush at a later point. So AF_ZEROCOPY, the socket parts, would have Tx support. @John Did I recall that correctly? >> * Minimize the size of the i40e zerocopy patches, by moving the driver >> specific code to separate patches. >> >> * Do not introduce a new XDP action XDP_PASS_TO_KERNEL, instead use >> XDP redirect map call with ingress flag. >> >> * Extend the XDP redirect to support explicit allocator/destructor >> functions. Right now, XDP redirect assumes that the page allocator >> was used, and the XDP redirect cleanup path is decreasing the page >> count of the XDP buffer. This assumption breaks for the zerocopy >> case. >> >> >> Björn >> >> >>> We based this patch set on net-next commit e1ea2f9856b7 ("Merge >>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net"). >>> >>> Please focus your review on: >>> >>> * The V4 user space interface >>> * PACKET_ZEROCOPY and its semantics >>> * Packet array interface >>> * XDP semantics when excuting in zero-copy mode (user space passed >>> buffers) >>> * XDP_PASS_TO_KERNEL semantics >>> >>> To do: >>> >>> * Investigate the user-space ring structure’s performance problems >>> * Continue the XDP integration into packet arrays >>> * Optimize performance >>> * SKB <-> V4 conversions in tp4a_populate & tp4a_flush >>> * Packet buffer is unnecessarily pinned for virtual devices >>> * Support shared packet buffers >>> * Unify V4 and SKB receive path in I40E driver >>> * Support for packets spanning multiple frames >>> * Disassociate the packet array implementation from the V4 queue >>> structure >>> >>> We would really like to thank the reviewers of the limited >>> distribution RFC for all their comments that have helped improve the >>> interfaces and the code significantly: Alexei Starovoitov, Alexander >>> Duyck, Jesper Dangaard Brouer, and John Fastabend. The internal team >>> at Intel that has been helping out reviewing code, writing tests, and >>> sanity checking our ideas: Rami Rosen, Jeff Shaw, Ferruh Yigit, and Qi >>> Zhang, your participation has really helped. >>> >>> Thanks: Björn and Magnus >>> >>> [1] >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.netdevconf.org_2.2_=DwIFaQ=5VD0RTtNlTh3ycd41b3MUw=qR6oNZj1CqLATni4ibTgAQ=lKyFxON3kKygiOgECLBfmqRwM7ZyXFSUvLED1vP-gos=44jzm1W8xkGyZSZVANRygzHz6y4XHbYrYBRM-K5RhTc= >>> >>> >>> Björn Töpel (7): >>> packet: introduce AF_PACKET V4 userspace API >>> packet: implement PACKET_MEMREG setsockopt >>> packet: enable AF_PACKET V4 rings >>> packet: wire up zerocopy for AF_PACKET V4 >>> i40e: AF_PACKET V4 ndo_tp4_zerocopy Rx support >>> i40e: AF_PACKET V4 ndo_tp4_zerocopy Tx support >>> samples/tpacket4: added tpbench >>> >>> Magnus Karlsson (7): >>> packet: enable Rx for AF_PACKET V4 >>> packet: enable Tx support for AF_PACKET V4 >>> netdevice: add AF_PACKET V4 zerocopy ops >>> veth: added support for PACKET_ZEROCOPY >>> samples/tpacket4: added veth support >>> i40e: added XDP support for TP4 enabled queue pairs >>> xdp: introducing XDP_PASS_TO_KERNEL for PACKET_ZEROCOPY use >>> >>> drivers/net/ethernet/intel/i40e/i40e.h |3 + >>> drivers/net/ethernet/intel/i40e/i40e_ethtool.c |9 + >>>
Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
On 11/13/17 9:07 PM, Björn Töpel wrote: 2017-10-31 13:41 GMT+01:00 Björn Töpel: From: Björn Töpel [...] We'll do a presentation on AF_PACKET V4 in NetDev 2.2 [1] Seoul, Korea, and our paper with complete benchmarks will be released shortly on the NetDev 2.2 site. We're back in the saddle after an excellent netdevconf week. Kudos to the organizers; We had a blast! Thanks for all the constructive feedback. I'll summarize the major points, that we'll address in the next RFC below. * Instead of extending AF_PACKET with yet another version, introduce a new address/packet family. As for naming had some name suggestions: AF_CAPTURE, AF_CHANNEL, AF_XDP and AF_ZEROCOPY. We'll go for AF_ZEROCOPY, unless there're no strong opinions against it. * No explicit zerocopy enablement. Use the zeropcopy path if supported, if not -- fallback to the skb path, for netdevs that don't support the required ndos. Further, we'll have the zerocopy behavior for the skb path as well, meaning that an AF_ZEROCOPY socket will consume the skb and we'll honor skb->queue_mapping, meaning that we only consume the packets for the enabled queue. * Limit the scope of the first patchset to Rx only, and introduce Tx in a separate patchset. all sounds good to me except above bit. I don't remember people suggesting to split it this way. What's the value of it without tx? * Minimize the size of the i40e zerocopy patches, by moving the driver specific code to separate patches. * Do not introduce a new XDP action XDP_PASS_TO_KERNEL, instead use XDP redirect map call with ingress flag. * Extend the XDP redirect to support explicit allocator/destructor functions. Right now, XDP redirect assumes that the page allocator was used, and the XDP redirect cleanup path is decreasing the page count of the XDP buffer. This assumption breaks for the zerocopy case. Björn We based this patch set on net-next commit e1ea2f9856b7 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net"). Please focus your review on: * The V4 user space interface * PACKET_ZEROCOPY and its semantics * Packet array interface * XDP semantics when excuting in zero-copy mode (user space passed buffers) * XDP_PASS_TO_KERNEL semantics To do: * Investigate the user-space ring structure’s performance problems * Continue the XDP integration into packet arrays * Optimize performance * SKB <-> V4 conversions in tp4a_populate & tp4a_flush * Packet buffer is unnecessarily pinned for virtual devices * Support shared packet buffers * Unify V4 and SKB receive path in I40E driver * Support for packets spanning multiple frames * Disassociate the packet array implementation from the V4 queue structure We would really like to thank the reviewers of the limited distribution RFC for all their comments that have helped improve the interfaces and the code significantly: Alexei Starovoitov, Alexander Duyck, Jesper Dangaard Brouer, and John Fastabend. The internal team at Intel that has been helping out reviewing code, writing tests, and sanity checking our ideas: Rami Rosen, Jeff Shaw, Ferruh Yigit, and Qi Zhang, your participation has really helped. Thanks: Björn and Magnus [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__www.netdevconf.org_2.2_=DwIFaQ=5VD0RTtNlTh3ycd41b3MUw=qR6oNZj1CqLATni4ibTgAQ=lKyFxON3kKygiOgECLBfmqRwM7ZyXFSUvLED1vP-gos=44jzm1W8xkGyZSZVANRygzHz6y4XHbYrYBRM-K5RhTc= Björn Töpel (7): packet: introduce AF_PACKET V4 userspace API packet: implement PACKET_MEMREG setsockopt packet: enable AF_PACKET V4 rings packet: wire up zerocopy for AF_PACKET V4 i40e: AF_PACKET V4 ndo_tp4_zerocopy Rx support i40e: AF_PACKET V4 ndo_tp4_zerocopy Tx support samples/tpacket4: added tpbench Magnus Karlsson (7): packet: enable Rx for AF_PACKET V4 packet: enable Tx support for AF_PACKET V4 netdevice: add AF_PACKET V4 zerocopy ops veth: added support for PACKET_ZEROCOPY samples/tpacket4: added veth support i40e: added XDP support for TP4 enabled queue pairs xdp: introducing XDP_PASS_TO_KERNEL for PACKET_ZEROCOPY use drivers/net/ethernet/intel/i40e/i40e.h |3 + drivers/net/ethernet/intel/i40e/i40e_ethtool.c |9 + drivers/net/ethernet/intel/i40e/i40e_main.c| 837 - drivers/net/ethernet/intel/i40e/i40e_txrx.c| 582 - drivers/net/ethernet/intel/i40e/i40e_txrx.h| 38 + drivers/net/veth.c | 174 +++ include/linux/netdevice.h | 16 + include/linux/tpacket4.h | 1502 include/uapi/linux/bpf.h |1 + include/uapi/linux/if_packet.h | 65 +- net/packet/af_packet.c | 1252 +--- net/packet/internal.h |9 + samples/tpacket4/Makefile | 12 + samples/tpacket4/bench_all.sh | 28
Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
On 11/13/2017 05:07 AM, Björn Töpel wrote: > 2017-10-31 13:41 GMT+01:00 Björn Töpel: >> From: Björn Töpel >> > [...] >> >> We'll do a presentation on AF_PACKET V4 in NetDev 2.2 [1] Seoul, >> Korea, and our paper with complete benchmarks will be released shortly >> on the NetDev 2.2 site. >> > > We're back in the saddle after an excellent netdevconf week. Kudos to > the organizers; We had a blast! Thanks for all the constructive > feedback. > > I'll summarize the major points, that we'll address in the next RFC > below. > > * Instead of extending AF_PACKET with yet another version, introduce a > new address/packet family. As for naming had some name suggestions: > AF_CAPTURE, AF_CHANNEL, AF_XDP and AF_ZEROCOPY. We'll go for > AF_ZEROCOPY, unless there're no strong opinions against it. > Works for me. > * No explicit zerocopy enablement. Use the zeropcopy path if > supported, if not -- fallback to the skb path, for netdevs that > don't support the required ndos. Further, we'll have the zerocopy > behavior for the skb path as well, meaning that an AF_ZEROCOPY > socket will consume the skb and we'll honor skb->queue_mapping, > meaning that we only consume the packets for the enabled queue. > > * Limit the scope of the first patchset to Rx only, and introduce Tx > in a separate patchset. > > * Minimize the size of the i40e zerocopy patches, by moving the driver > specific code to separate patches. > > * Do not introduce a new XDP action XDP_PASS_TO_KERNEL, instead use > XDP redirect map call with ingress flag. > Sounds good we will need to add this as a separate patch series though. > * Extend the XDP redirect to support explicit allocator/destructor > functions. Right now, XDP redirect assumes that the page allocator > was used, and the XDP redirect cleanup path is decreasing the page > count of the XDP buffer. This assumption breaks for the zerocopy > case. > Probably sync with Andy and Jesper on this. I think they are both looking into something similar. Thanks, John > > Björn > >
Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
2017-10-31 13:41 GMT+01:00 Björn Töpel: > From: Björn Töpel > [...] > > We'll do a presentation on AF_PACKET V4 in NetDev 2.2 [1] Seoul, > Korea, and our paper with complete benchmarks will be released shortly > on the NetDev 2.2 site. > We're back in the saddle after an excellent netdevconf week. Kudos to the organizers; We had a blast! Thanks for all the constructive feedback. I'll summarize the major points, that we'll address in the next RFC below. * Instead of extending AF_PACKET with yet another version, introduce a new address/packet family. As for naming had some name suggestions: AF_CAPTURE, AF_CHANNEL, AF_XDP and AF_ZEROCOPY. We'll go for AF_ZEROCOPY, unless there're no strong opinions against it. * No explicit zerocopy enablement. Use the zeropcopy path if supported, if not -- fallback to the skb path, for netdevs that don't support the required ndos. Further, we'll have the zerocopy behavior for the skb path as well, meaning that an AF_ZEROCOPY socket will consume the skb and we'll honor skb->queue_mapping, meaning that we only consume the packets for the enabled queue. * Limit the scope of the first patchset to Rx only, and introduce Tx in a separate patchset. * Minimize the size of the i40e zerocopy patches, by moving the driver specific code to separate patches. * Do not introduce a new XDP action XDP_PASS_TO_KERNEL, instead use XDP redirect map call with ingress flag. * Extend the XDP redirect to support explicit allocator/destructor functions. Right now, XDP redirect assumes that the page allocator was used, and the XDP redirect cleanup path is decreasing the page count of the XDP buffer. This assumption breaks for the zerocopy case. Björn > We based this patch set on net-next commit e1ea2f9856b7 ("Merge > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net"). > > Please focus your review on: > > * The V4 user space interface > * PACKET_ZEROCOPY and its semantics > * Packet array interface > * XDP semantics when excuting in zero-copy mode (user space passed > buffers) > * XDP_PASS_TO_KERNEL semantics > > To do: > > * Investigate the user-space ring structure’s performance problems > * Continue the XDP integration into packet arrays > * Optimize performance > * SKB <-> V4 conversions in tp4a_populate & tp4a_flush > * Packet buffer is unnecessarily pinned for virtual devices > * Support shared packet buffers > * Unify V4 and SKB receive path in I40E driver > * Support for packets spanning multiple frames > * Disassociate the packet array implementation from the V4 queue > structure > > We would really like to thank the reviewers of the limited > distribution RFC for all their comments that have helped improve the > interfaces and the code significantly: Alexei Starovoitov, Alexander > Duyck, Jesper Dangaard Brouer, and John Fastabend. The internal team > at Intel that has been helping out reviewing code, writing tests, and > sanity checking our ideas: Rami Rosen, Jeff Shaw, Ferruh Yigit, and Qi > Zhang, your participation has really helped. > > Thanks: Björn and Magnus > > [1] https://www.netdevconf.org/2.2/ > > Björn Töpel (7): > packet: introduce AF_PACKET V4 userspace API > packet: implement PACKET_MEMREG setsockopt > packet: enable AF_PACKET V4 rings > packet: wire up zerocopy for AF_PACKET V4 > i40e: AF_PACKET V4 ndo_tp4_zerocopy Rx support > i40e: AF_PACKET V4 ndo_tp4_zerocopy Tx support > samples/tpacket4: added tpbench > > Magnus Karlsson (7): > packet: enable Rx for AF_PACKET V4 > packet: enable Tx support for AF_PACKET V4 > netdevice: add AF_PACKET V4 zerocopy ops > veth: added support for PACKET_ZEROCOPY > samples/tpacket4: added veth support > i40e: added XDP support for TP4 enabled queue pairs > xdp: introducing XDP_PASS_TO_KERNEL for PACKET_ZEROCOPY use > > drivers/net/ethernet/intel/i40e/i40e.h |3 + > drivers/net/ethernet/intel/i40e/i40e_ethtool.c |9 + > drivers/net/ethernet/intel/i40e/i40e_main.c| 837 - > drivers/net/ethernet/intel/i40e/i40e_txrx.c| 582 - > drivers/net/ethernet/intel/i40e/i40e_txrx.h| 38 + > drivers/net/veth.c | 174 +++ > include/linux/netdevice.h | 16 + > include/linux/tpacket4.h | 1502 > > include/uapi/linux/bpf.h |1 + > include/uapi/linux/if_packet.h | 65 +- > net/packet/af_packet.c | 1252 +--- > net/packet/internal.h |9 + > samples/tpacket4/Makefile | 12 + > samples/tpacket4/bench_all.sh | 28 + > samples/tpacket4/tpbench.c | 1390 ++ > 15 files changed, 5674 insertions(+), 244 deletions(-) > create mode 100644 include/linux/tpacket4.h > create mode 100644
Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
On Fri, Nov 3, 2017 at 7:13 PM, Karlsson, Magnus <magnus.karls...@intel.com> wrote: > > >> -Original Message- >> From: Willem de Bruijn [mailto:willemdebruijn.ker...@gmail.com] >> Sent: Friday, November 3, 2017 5:35 AM >> To: Björn Töpel <bjorn.to...@gmail.com> >> Cc: Karlsson, Magnus <magnus.karls...@intel.com>; Duyck, Alexander H >> <alexander.h.du...@intel.com>; Alexander Duyck >> <alexander.du...@gmail.com>; John Fastabend >> <john.fastab...@gmail.com>; Alexei Starovoitov <a...@fb.com>; Jesper >> Dangaard Brouer <bro...@redhat.com>; michael.lundkv...@ericsson.com; >> ravineet.si...@ericsson.com; Daniel Borkmann <dan...@iogearbox.net>; >> Network Development <netdev@vger.kernel.org>; Topel, Bjorn >> <bjorn.to...@intel.com>; Brandeburg, Jesse >> <jesse.brandeb...@intel.com>; Singhai, Anjali <anjali.sing...@intel.com>; >> Rosen, Rami <rami.ro...@intel.com>; Shaw, Jeffrey B >> <jeffrey.b.s...@intel.com>; Yigit, Ferruh <ferruh.yi...@intel.com>; Zhang, >> Qi Z <qi.z.zh...@intel.com> >> Subject: Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support >> >> On Tue, Oct 31, 2017 at 9:41 PM, Björn Töpel <bjorn.to...@gmail.com> >> wrote: >> > From: Björn Töpel <bjorn.to...@intel.com> >> > >> > This RFC introduces AF_PACKET_V4 and PACKET_ZEROCOPY that are >> > optimized for high performance packet processing and zero-copy >> > semantics. Throughput improvements can be up to 40x compared to V2 >> and >> > V3 for the micro benchmarks included. Would be great to get your >> > feedback on it. >> > >> > The main difference between V4 and V2/V3 is that TX and RX descriptors >> > are separated from packet buffers. >> >> Cool feature. I'm looking forward to the netdev talk. Aside from the inline >> comments in the patches, a few architecture questions. > > Glad to hear. Are you going to Netdev in Seoul? If so, let us hook up > and discuss your comments in further detail. Some initial thoughts > below. Sounds great. I'll be there. >> Is TX support needed? Existing PACKET_TX_RING already sends out packets >> without copying directly from the tx_ring. Indirection through a descriptor >> ring is not helpful on TX if all packets still have to come from a >> pre-registered >> packet pool. The patch set adds a lot of tx-only code and is complex enough >> without it. > > That is correct, but what if the packet you are going to transmit came > in from the receive path and is already in the packet buffer? Oh, yes, of course. That is a common use case. I should have thought of that. > This > might happen if the application is examining/sniffing packets then > sending them out, or doing some modification to them. In that case we > avoid a copy in V4 since the packet is already in the packet > buffer. With V2 and V3, a copy from the RX ring to the TX ring would > be needed. In the PACKET_ZEROCOPY case, avoiding this copy increases > performance quite a lot. > >> Can you use the existing PACKET_V2 format for the packet pool? The >> v4 format is nearly the same as V2. Using the same version might avoid some >> code duplication and simplify upgrading existing legacy code. >> Instead of continuing to add new versions whose behavior is implicit, >> perhaps we can add explicit mode PACKET_INDIRECT to PACKET_V2. > > Interesting idea that I think is worth thinking more about. One > problem though with the V2 ring format model, and the current V4 > format too by the way, when applied to a user-space allocated memory > is that they are symmetric, i.e. that user space and kernel space have > to produce and consume the same amount of entries (within the length > of the descriptor area). User space sends down a buffer entry that the > kernel fills in for RX for example. Symmetric queues do not work when > you have a shared packet buffer between two processes. (This is not a > requirement, but someone might do a mmap with MAP_SHARED for the > packet buffer and then fork of a child that then inherits this packet > buffer.) One of the processes might just receive packets, while the > other one is transmitting. Or if you have a veth link pair between two > processes and they have been set up to share packet buffer area. With > a symmetric queue you have to copy even if they share the same packet > buffer, but with an asymmetric queue, you do not and the driver only > needs to copy the packet buffer id between the TX desc ring of the > sender to the RX desc ring of the receiver, not the data. I think this > gives an indicatio
RE: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
> -Original Message- > From: Willem de Bruijn [mailto:willemdebruijn.ker...@gmail.com] > Sent: Friday, November 3, 2017 5:35 AM > To: Björn Töpel <bjorn.to...@gmail.com> > Cc: Karlsson, Magnus <magnus.karls...@intel.com>; Duyck, Alexander H > <alexander.h.du...@intel.com>; Alexander Duyck > <alexander.du...@gmail.com>; John Fastabend > <john.fastab...@gmail.com>; Alexei Starovoitov <a...@fb.com>; Jesper > Dangaard Brouer <bro...@redhat.com>; michael.lundkv...@ericsson.com; > ravineet.si...@ericsson.com; Daniel Borkmann <dan...@iogearbox.net>; > Network Development <netdev@vger.kernel.org>; Topel, Bjorn > <bjorn.to...@intel.com>; Brandeburg, Jesse > <jesse.brandeb...@intel.com>; Singhai, Anjali <anjali.sing...@intel.com>; > Rosen, Rami <rami.ro...@intel.com>; Shaw, Jeffrey B > <jeffrey.b.s...@intel.com>; Yigit, Ferruh <ferruh.yi...@intel.com>; Zhang, > Qi Z <qi.z.zh...@intel.com> > Subject: Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support > > On Tue, Oct 31, 2017 at 9:41 PM, Björn Töpel <bjorn.to...@gmail.com> > wrote: > > From: Björn Töpel <bjorn.to...@intel.com> > > > > This RFC introduces AF_PACKET_V4 and PACKET_ZEROCOPY that are > > optimized for high performance packet processing and zero-copy > > semantics. Throughput improvements can be up to 40x compared to V2 > and > > V3 for the micro benchmarks included. Would be great to get your > > feedback on it. > > > > The main difference between V4 and V2/V3 is that TX and RX descriptors > > are separated from packet buffers. > > Cool feature. I'm looking forward to the netdev talk. Aside from the inline > comments in the patches, a few architecture questions. Glad to hear. Are you going to Netdev in Seoul? If so, let us hook up and discuss your comments in further detail. Some initial thoughts below. > Is TX support needed? Existing PACKET_TX_RING already sends out packets > without copying directly from the tx_ring. Indirection through a descriptor > ring is not helpful on TX if all packets still have to come from a > pre-registered > packet pool. The patch set adds a lot of tx-only code and is complex enough > without it. That is correct, but what if the packet you are going to transmit came in from the receive path and is already in the packet buffer? This might happen if the application is examining/sniffing packets then sending them out, or doing some modification to them. In that case we avoid a copy in V4 since the packet is already in the packet buffer. With V2 and V3, a copy from the RX ring to the TX ring would be needed. In the PACKET_ZEROCOPY case, avoiding this copy increases performance quite a lot. > Can you use the existing PACKET_V2 format for the packet pool? The > v4 format is nearly the same as V2. Using the same version might avoid some > code duplication and simplify upgrading existing legacy code. > Instead of continuing to add new versions whose behavior is implicit, > perhaps we can add explicit mode PACKET_INDIRECT to PACKET_V2. Interesting idea that I think is worth thinking more about. One problem though with the V2 ring format model, and the current V4 format too by the way, when applied to a user-space allocated memory is that they are symmetric, i.e. that user space and kernel space have to produce and consume the same amount of entries (within the length of the descriptor area). User space sends down a buffer entry that the kernel fills in for RX for example. Symmetric queues do not work when you have a shared packet buffer between two processes. (This is not a requirement, but someone might do a mmap with MAP_SHARED for the packet buffer and then fork of a child that then inherits this packet buffer.) One of the processes might just receive packets, while the other one is transmitting. Or if you have a veth link pair between two processes and they have been set up to share packet buffer area. With a symmetric queue you have to copy even if they share the same packet buffer, but with an asymmetric queue, you do not and the driver only needs to copy the packet buffer id between the TX desc ring of the sender to the RX desc ring of the receiver, not the data. I think this gives an indication that we need a new structure. Anyway, I like your idea and I think it is worth thinking more about it. Let us have a discussion about this at Netdev, if you are there. > Finally, is it necessary to define a new descriptor ring format? Same for the > packet array and frame set. The kernel already has a few, such as virtio for > the first, skb_array/ptr_ring, even linux list for the second. These > containers > add a lot of new boilerplate code. If new formats are absolutely necessary, at > least we should consider making them
Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
On Tue, Oct 31, 2017 at 9:41 PM, Björn Töpelwrote: > From: Björn Töpel > > This RFC introduces AF_PACKET_V4 and PACKET_ZEROCOPY that are > optimized for high performance packet processing and zero-copy > semantics. Throughput improvements can be up to 40x compared to V2 and > V3 for the micro benchmarks included. Would be great to get your > feedback on it. > > The main difference between V4 and V2/V3 is that TX and RX descriptors > are separated from packet buffers. Cool feature. I'm looking forward to the netdev talk. Aside from the inline comments in the patches, a few architecture questions. Is TX support needed? Existing PACKET_TX_RING already sends out packets without copying directly from the tx_ring. Indirection through a descriptor ring is not helpful on TX if all packets still have to come from a pre-registered packet pool. The patch set adds a lot of tx-only code and is complex enough without it. Can you use the existing PACKET_V2 format for the packet pool? The v4 format is nearly the same as V2. Using the same version might avoid some code duplication and simplify upgrading existing legacy code. Instead of continuing to add new versions whose behavior is implicit, perhaps we can add explicit mode PACKET_INDIRECT to PACKET_V2. Finally, is it necessary to define a new descriptor ring format? Same for the packet array and frame set. The kernel already has a few, such as virtio for the first, skb_array/ptr_ring, even linux list for the second. These containers add a lot of new boilerplate code. If new formats are absolutely necessary, at least we should consider making them generic (like skb_array and ptr_ring). But I'd like to understand first why, e.g., virtio cannot be used.
[RFC PATCH 00/14] Introducing AF_PACKET V4 support
From: Björn TöpelThis RFC introduces AF_PACKET_V4 and PACKET_ZEROCOPY that are optimized for high performance packet processing and zero-copy semantics. Throughput improvements can be up to 40x compared to V2 and V3 for the micro benchmarks included. Would be great to get your feedback on it. The main difference between V4 and V2/V3 is that TX and RX descriptors are separated from packet buffers. An RX or TX descriptor points to a data buffer in a packet buffer area. RX and TX can share the same packet buffer so that a packet does not have to be copied between RX and TX. Moreover, if a packet needs to be kept for a while due to a possible retransmit, then the descriptor that points to that packet buffer can be changed to point to another buffer and reused right away. This again avoids copying data. The RX and TX descriptor rings are registered with the setsockopts PACKET_RX_RING and PACKET_TX_RING, as usual. The packet buffer area is allocated by user space and registered with the kernel using the new PACKET_MEMREG setsockopt. All these three areas are shared between user space and kernel space. When V4 executes like this, we say that it executes in "copy-mode". Each packet is sent to the Linux stack and a copy of it is sent to user space, so V4 behaves in the same way as V2 and V3. All syscalls operating on file descriptors should just work as if it was V2 or V3. However, when the new PACKET_ZEROCOPY setsockopt is called, V4 starts to operate in true zero-copy mode. In this mode, the networking HW (or SW driver if it is a virtual driver like veth) DMAs/puts packets straight into the packet buffer that is shared between user space and kernel space. The RX and TX descriptor queues of the networking HW are NOT shared to user space. Only the kernel can read and write these and it is the kernel drivers responsibility to translate these HW specific descriptors to the HW agnostic ones in the V4 virtual descriptor rings that user space sees. This way, a malicious user space program cannot mess with the networking HW. The PACKET_ZEROCOPY setsockopt acts on a queue pair (channel in ethtool speak), so one needs to steer the traffic to the zero-copy enabled queue pair. Which queue to use, is up to the user. For an untrusted application, HW packet steering to a specific queue pair (the one associated with the application) is a requirement, as the application would otherwise be able to see other user space processes' packets. If the HW cannot support the required packet steering, packets need to be DMA:ed into non user-space visible kernel buffers and from there copied out to user space. This RFC only addresses NIC HW with packet steering capabilities. PACKET_ZEROCOPY comes with "XDP batteries included", so XDP programs will be executed for zero-copy enabled queues. We're also suggesting adding a new XDP action, XDP_PASS_TO_KERNEL, to pass copies to the kernel stack instead of the V4 user space queue in zero-copy mode. There's a tpbench benchmarking/test application included. Say that you'd like your UDP traffic from port 4242 to end up in queue 16, that we'll enable zero-copy on. Here, we use ethtool for this: ethtool -N p3p2 rx-flow-hash udp4 fn ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \ action 16 Running the benchmark in zero-copy mode can then be done using: tpbench -i p3p2 --rxdrop --zerocopy 17 Note that the --zerocopy command-line argument is one-based, and not zero-based. We've run some benchmarks on a dual socket system with two Broadwell E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14 cores which gives a total of 28, but only two cores are used in these experiments. One for Tx/Rx and one for the user space application. The memory is DDR4 @ 1067 MT/s and the size of each DIMM is 8192MB and with 8 of those DIMMs in the system we have 64 GB of total memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an Intel I40E 40Gbit/s using the i40e driver. Below are the results in Mpps of the I40E NIC benchmark runs for 64 byte packets, generated by commercial packet generator HW that is generating packets at full 40 Gbit/s line rate. Benchmark V2 V3 V4 V4+ZC rxdrop 0.67 0.73 0.74 33.7 txpush 0.98 0.98 0.91 19.6 l2fwd 0.66 0.71 0.67 15.5 The results are generated using the "bench_all.sh" script. We'll do a presentation on AF_PACKET V4 in NetDev 2.2 [1] Seoul, Korea, and our paper with complete benchmarks will be released shortly on the NetDev 2.2 site. We based this patch set on net-next commit e1ea2f9856b7 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net"). Please focus your review on: * The V4 user space interface * PACKET_ZEROCOPY and its semantics * Packet array interface * XDP semantics when excuting in zero-copy mode (user space passed buffers) * XDP_PASS_TO_KERNEL semantics To do: * Investigate the user-space ring