Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
On Tue, Feb 28, 2017 at 7:28 PM, David Millerwrote: > From: Andy Lutomirski > Date: Tue, 28 Feb 2017 13:06:49 -0800 > >> On Tue, Feb 28, 2017 at 12:43 PM, Willem de Bruijn >> wrote: >>> On Tue, Feb 28, 2017 at 2:46 PM, Andy Lutomirski >>> wrote: On Mon, Feb 27, 2017 at 10:57 AM, Michael Kerrisk wrote: > [CC += linux-...@vger.kernel.org] > > Hi Willem > >> On a send call with MSG_ZEROCOPY, the kernel pins the user pages and >> creates skbuff fragments directly from these pages. On tx completion, >> it notifies the socket owner that it is safe to modify memory by >> queuing a completion notification onto the socket error queue. What happens if the user writes to the pages while it's not safe? How about if you're writing to an interface or a route that has crypto involved and a malicious user can make the data change in the middle of a crypto operation, thus perhaps leaking the entire key? (I wouldn't be at all surprised if a lot of provably secure AEAD constructions are entirely compromised if an attacker can get the ciphertext and tag computed from a message that changed during the computation. >>> >>> Operations that read or write payload, such as this crypto example, >>> but also ebpf in tc or iptables, for instance, demand a deep copy using >>> skb_copy_ubufs before the operation. >>> >>> This blacklist approach requires caution, but these paths should be >>> few and countable. It is not possible to predict at the socket layer >>> whether a packet will encounter any such operation, so white-listing >>> a subset of end-to-end paths is not practical. >> >> How about hardware that malfunctions if the packet changes out from >> under it? A whitelist seems quite a bit safer. > > These device are already choking, because as I stated this can already > be done via sendfile(). > > Networking card wise this isn't an issue, chips bring the entire packet > into their FIFO, compute checksums on the fly mid-stream, and then write > the 16-bit checksum field before starting to write the packet onto the > wire. > > I think this is completely a non-issue, and we thought about this right > from the start when sendfile() support was added nearly two decades ago. > If network cards from back then didn't crap out in this situation I > think the ones out there now are probably ok. Fair enough.
Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
On Tue, Feb 28, 2017 at 7:28 PM, Tom Herbertwrote: > On Tue, Feb 28, 2017 at 3:22 PM, Eric Dumazet wrote: >> On Tue, 2017-02-28 at 14:52 -0800, Andy Lutomirski wrote: >> >>> The user pages are a gift to the kernel. The application may not >>> modify this memory ever, otherwise the page cache and on-disk data may >>> differ. >>> >>> This is just not okay IMO. >> >> TCP works just fine in this case. >> >> TX checksum will be computed by the NIC after/while data is copied. >> >> If really the application changes the data, that will not cause any >> problems, other than user side consistency. >> >> This is why we require a copy (for all buffers that came from zero-copy) >> if network stack hits a device that can not offload TX checksum. >> >> Even pwrite() does not guarantee consistency if multiple threads are >> using it on overlapping regions. >> > The Mellanox team working on TLS offload pointed out to us that if > data is changed for a retransmit then it becomes trivial for someone > snooping to break the encryption. Sounds pretty scary and it would be > a shame if we couldn't use zero-copy in that use case :-( Hopefully we > can find a solution... > This requires collusion by the process initiating the zerocopy send to help the entity snooping the link. That could be an attack on admin configured tunnels, but user-directed encryption offload like AF_TLS can still use zerocopy.
Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
On Tue, 2017-02-28 at 22:28 -0500, David Miller wrote: > These device are already choking, because as I stated this can already > be done via sendfile(). > > Networking card wise this isn't an issue, chips bring the entire packet > into their FIFO, compute checksums on the fly mid-stream, and then write > the 16-bit checksum field before starting to write the packet onto the > wire. > > I think this is completely a non-issue, and we thought about this right > from the start when sendfile() support was added nearly two decades ago. > If network cards from back then didn't crap out in this situation I > think the ones out there now are probably ok. Well, we had to fix one issue with GSO fall back about 4 years ago. https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=cef401de7be8c4e155c6746bfccf721a4fa5fab9 So extra scrutiny would be nice. Zero copy is incredibly hard to get right.
Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
From: Andy LutomirskiDate: Tue, 28 Feb 2017 13:06:49 -0800 > On Tue, Feb 28, 2017 at 12:43 PM, Willem de Bruijn > wrote: >> On Tue, Feb 28, 2017 at 2:46 PM, Andy Lutomirski wrote: >>> On Mon, Feb 27, 2017 at 10:57 AM, Michael Kerrisk >>> wrote: [CC += linux-...@vger.kernel.org] Hi Willem >>> > On a send call with MSG_ZEROCOPY, the kernel pins the user pages and > creates skbuff fragments directly from these pages. On tx completion, > it notifies the socket owner that it is safe to modify memory by > queuing a completion notification onto the socket error queue. >>> >>> What happens if the user writes to the pages while it's not safe? >>> >>> How about if you're writing to an interface or a route that has crypto >>> involved and a malicious user can make the data change in the middle >>> of a crypto operation, thus perhaps leaking the entire key? (I >>> wouldn't be at all surprised if a lot of provably secure AEAD >>> constructions are entirely compromised if an attacker can get the >>> ciphertext and tag computed from a message that changed during the >>> computation. >> >> Operations that read or write payload, such as this crypto example, >> but also ebpf in tc or iptables, for instance, demand a deep copy using >> skb_copy_ubufs before the operation. >> >> This blacklist approach requires caution, but these paths should be >> few and countable. It is not possible to predict at the socket layer >> whether a packet will encounter any such operation, so white-listing >> a subset of end-to-end paths is not practical. > > How about hardware that malfunctions if the packet changes out from > under it? A whitelist seems quite a bit safer. These device are already choking, because as I stated this can already be done via sendfile(). Networking card wise this isn't an issue, chips bring the entire packet into their FIFO, compute checksums on the fly mid-stream, and then write the 16-bit checksum field before starting to write the packet onto the wire. I think this is completely a non-issue, and we thought about this right from the start when sendfile() support was added nearly two decades ago. If network cards from back then didn't crap out in this situation I think the ones out there now are probably ok.
Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
From: Andy LutomirskiDate: Tue, 28 Feb 2017 11:46:23 -0800 > On Mon, Feb 27, 2017 at 10:57 AM, Michael Kerrisk > wrote: >> [CC += linux-...@vger.kernel.org] >> >> Hi Willem >> > >>> On a send call with MSG_ZEROCOPY, the kernel pins the user pages and >>> creates skbuff fragments directly from these pages. On tx completion, >>> it notifies the socket owner that it is safe to modify memory by >>> queuing a completion notification onto the socket error queue. > > What happens if the user writes to the pages while it's not safe? Just want to mention that this ability to write to data behind a network send's back is not a new thing added by MSG_ZEROCOPY. All of this is already possible with sendfile(). The pages can be written to completely asynchronously to the data being pulled out of the page cache into the transmit path. The crypto case is interesting, but that is a seperate discussion about an existing problem rather than something specifically new to the MSG_ZEROCOPY changes. Thanks.
Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
On Tue, Feb 28, 2017 at 4:58 PM, Willem de Bruijnwrote: > On Tue, Feb 28, 2017 at 7:28 PM, Tom Herbert wrote: >> On Tue, Feb 28, 2017 at 3:22 PM, Eric Dumazet wrote: >>> On Tue, 2017-02-28 at 14:52 -0800, Andy Lutomirski wrote: >>> The user pages are a gift to the kernel. The application may not modify this memory ever, otherwise the page cache and on-disk data may differ. This is just not okay IMO. >>> >>> TCP works just fine in this case. >>> >>> TX checksum will be computed by the NIC after/while data is copied. >>> >>> If really the application changes the data, that will not cause any >>> problems, other than user side consistency. >>> >>> This is why we require a copy (for all buffers that came from zero-copy) >>> if network stack hits a device that can not offload TX checksum. >>> >>> Even pwrite() does not guarantee consistency if multiple threads are >>> using it on overlapping regions. >>> >> The Mellanox team working on TLS offload pointed out to us that if >> data is changed for a retransmit then it becomes trivial for someone >> snooping to break the encryption. Sounds pretty scary and it would be >> a shame if we couldn't use zero-copy in that use case :-( Hopefully we >> can find a solution... >> > > This requires collusion by the process initiating the zerocopy send > to help the entity snooping the link. That could be an attack on admin > configured tunnels, but user-directed encryption offload like AF_TLS > can still use zerocopy. Yes, but we can't trust the user to always understand or correctly implement the semantic nuances when security is involved. If we can't provide a robust API then the only recourse is to not allow zero copy in that case. We could suggest COW to solve all problems, but I think we know where the conversation will go ;-) Tom
Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
On Tue, 2017-02-28 at 14:25 -0800, Andy Lutomirski wrote: > On Tue, Feb 28, 2017 at 1:47 PM, Eric Dumazetwrote: > > On Tue, 2017-02-28 at 13:09 -0800, Andy Lutomirski wrote: > > > >> Does this mean that a user program that does a zerocopy send can cause > >> a retransmitted segment to contain different data than the original > >> segment? If so, is that okay? > > > > Same remark applies to sendfile() already > > True. > > >, or other zero copy modes > > (vmsplice() + splice() ) > > I hate vmsplice(). I thought I remembered it being essentially > disabled at some point due to security problems. Right, zero copy is hard ;) vmsplice() is not disabled in current kernels, unless I missed something.
Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
On Tue, Feb 28, 2017 at 3:22 PM, Eric Dumazetwrote: > On Tue, 2017-02-28 at 14:52 -0800, Andy Lutomirski wrote: > >> The user pages are a gift to the kernel. The application may not >> modify this memory ever, otherwise the page cache and on-disk data may >> differ. >> >> This is just not okay IMO. > > TCP works just fine in this case. > > TX checksum will be computed by the NIC after/while data is copied. > > If really the application changes the data, that will not cause any > problems, other than user side consistency. > > This is why we require a copy (for all buffers that came from zero-copy) > if network stack hits a device that can not offload TX checksum. > > Even pwrite() does not guarantee consistency if multiple threads are > using it on overlapping regions. > The Mellanox team working on TLS offload pointed out to us that if data is changed for a retransmit then it becomes trivial for someone snooping to break the encryption. Sounds pretty scary and it would be a shame if we couldn't use zero-copy in that use case :-( Hopefully we can find a solution... Tom > >
Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
On Tue, 2017-02-28 at 16:28 -0800, Tom Herbert wrote: > The Mellanox team working on TLS offload pointed out to us that if > data is changed for a retransmit then it becomes trivial for someone > snooping to break the encryption. Sounds pretty scary and it would be > a shame if we couldn't use zero-copy in that use case :-( Hopefully we > can find a solution... Right, this is why offloading encryption over TCP is also hard ;)
Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
On Tue, 2017-02-28 at 14:52 -0800, Andy Lutomirski wrote: > The user pages are a gift to the kernel. The application may not > modify this memory ever, otherwise the page cache and on-disk data may > differ. > > This is just not okay IMO. TCP works just fine in this case. TX checksum will be computed by the NIC after/while data is copied. If really the application changes the data, that will not cause any problems, other than user side consistency. This is why we require a copy (for all buffers that came from zero-copy) if network stack hits a device that can not offload TX checksum. Even pwrite() does not guarantee consistency if multiple threads are using it on overlapping regions.
Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
On Tue, Feb 28, 2017 at 2:40 PM, Eric Dumazetwrote: > On Tue, 2017-02-28 at 14:25 -0800, Andy Lutomirski wrote: >> On Tue, Feb 28, 2017 at 1:47 PM, Eric Dumazet wrote: >> > On Tue, 2017-02-28 at 13:09 -0800, Andy Lutomirski wrote: >> > >> >> Does this mean that a user program that does a zerocopy send can cause >> >> a retransmitted segment to contain different data than the original >> >> segment? If so, is that okay? >> > >> > Same remark applies to sendfile() already >> >> True. >> >> >, or other zero copy modes >> > (vmsplice() + splice() ) >> >> I hate vmsplice(). I thought I remembered it being essentially >> disabled at some point due to security problems. > > Right, zero copy is hard ;) > > vmsplice() is not disabled in current kernels, unless I missed > something. > I think you're right. That being said, from the man page: The user pages are a gift to the kernel. The application may not modify this memory ever, otherwise the page cache and on-disk data may differ. This is just not okay IMO.
Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
On Tue, Feb 28, 2017 at 1:47 PM, Eric Dumazetwrote: > On Tue, 2017-02-28 at 13:09 -0800, Andy Lutomirski wrote: > >> Does this mean that a user program that does a zerocopy send can cause >> a retransmitted segment to contain different data than the original >> segment? If so, is that okay? > > Same remark applies to sendfile() already True. >, or other zero copy modes > (vmsplice() + splice() ) I hate vmsplice(). I thought I remembered it being essentially disabled at some point due to security problems. >
Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
On Tue, 2017-02-28 at 13:09 -0800, Andy Lutomirski wrote: > Does this mean that a user program that does a zerocopy send can cause > a retransmitted segment to contain different data than the original > segment? If so, is that okay? Same remark applies to sendfile() already, or other zero copy modes (vmsplice() + splice() )
Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
On Tue, Feb 28, 2017 at 12:43 PM, Willem de Bruijnwrote: > >> I can see this working if you have a special type of skb that >> indicates that the data might be concurrently written and have all the >> normal skb APIs (including, especially, anything that clones it) make >> a copy first. > > Support for cloned skbs is required for TCP, both at tcp_transmit_skb > and segmentation offload. Patch 4 especially adds reference counting > of shared pages across clones and other sk_buff operations like > pskb_expand_head. This still allows for deep copy (skb_copy_ubufs) > on clones in specific datapaths like the above. Does this mean that a user program that does a zerocopy send can cause a retransmitted segment to contain different data than the original segment? If so, is that okay?
Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
>>> I can see this working if you have a special type of skb that >>> indicates that the data might be concurrently written and have all the >>> normal skb APIs (including, especially, anything that clones it) make >>> a copy first. >> >> Support for cloned skbs is required for TCP, both at tcp_transmit_skb >> and segmentation offload. Patch 4 especially adds reference counting >> of shared pages across clones and other sk_buff operations like >> pskb_expand_head. This still allows for deep copy (skb_copy_ubufs) >> on clones in specific datapaths like the above. > > Does this mean that a user program that does a zerocopy send can cause > a retransmitted segment to contain different data than the original > segment? If so, is that okay? That is possible, indeed. The bytestream at the receiver is then likely undefined. Though integrity of the tcp receive stack should not be affected. A valid question is whether the stack should protect against such users. The pattern is reminiscent of evasion attacks. In the limited case, privileged users already can generate this data pattern, of course.
Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
On Tue, Feb 28, 2017 at 12:43 PM, Willem de Bruijnwrote: > On Tue, Feb 28, 2017 at 2:46 PM, Andy Lutomirski wrote: >> On Mon, Feb 27, 2017 at 10:57 AM, Michael Kerrisk >> wrote: >>> [CC += linux-...@vger.kernel.org] >>> >>> Hi Willem >>> >> On a send call with MSG_ZEROCOPY, the kernel pins the user pages and creates skbuff fragments directly from these pages. On tx completion, it notifies the socket owner that it is safe to modify memory by queuing a completion notification onto the socket error queue. >> >> What happens if the user writes to the pages while it's not safe? >> >> How about if you're writing to an interface or a route that has crypto >> involved and a malicious user can make the data change in the middle >> of a crypto operation, thus perhaps leaking the entire key? (I >> wouldn't be at all surprised if a lot of provably secure AEAD >> constructions are entirely compromised if an attacker can get the >> ciphertext and tag computed from a message that changed during the >> computation. > > Operations that read or write payload, such as this crypto example, > but also ebpf in tc or iptables, for instance, demand a deep copy using > skb_copy_ubufs before the operation. > > This blacklist approach requires caution, but these paths should be > few and countable. It is not possible to predict at the socket layer > whether a packet will encounter any such operation, so white-listing > a subset of end-to-end paths is not practical. How about hardware that malfunctions if the packet changes out from under it? A whitelist seems quite a bit safer. --Andy
Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
On Tue, Feb 28, 2017 at 2:46 PM, Andy Lutomirskiwrote: > On Mon, Feb 27, 2017 at 10:57 AM, Michael Kerrisk > wrote: >> [CC += linux-...@vger.kernel.org] >> >> Hi Willem >> > >>> On a send call with MSG_ZEROCOPY, the kernel pins the user pages and >>> creates skbuff fragments directly from these pages. On tx completion, >>> it notifies the socket owner that it is safe to modify memory by >>> queuing a completion notification onto the socket error queue. > > What happens if the user writes to the pages while it's not safe? > > How about if you're writing to an interface or a route that has crypto > involved and a malicious user can make the data change in the middle > of a crypto operation, thus perhaps leaking the entire key? (I > wouldn't be at all surprised if a lot of provably secure AEAD > constructions are entirely compromised if an attacker can get the > ciphertext and tag computed from a message that changed during the > computation. Operations that read or write payload, such as this crypto example, but also ebpf in tc or iptables, for instance, demand a deep copy using skb_copy_ubufs before the operation. This blacklist approach requires caution, but these paths should be few and countable. It is not possible to predict at the socket layer whether a packet will encounter any such operation, so white-listing a subset of end-to-end paths is not practical. > I can see this working if you have a special type of skb that > indicates that the data might be concurrently written and have all the > normal skb APIs (including, especially, anything that clones it) make > a copy first. Support for cloned skbs is required for TCP, both at tcp_transmit_skb and segmentation offload. Patch 4 especially adds reference counting of shared pages across clones and other sk_buff operations like pskb_expand_head. This still allows for deep copy (skb_copy_ubufs) on clones in specific datapaths like the above.
Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
On Mon, Feb 27, 2017 at 10:57 AM, Michael Kerriskwrote: > [CC += linux-...@vger.kernel.org] > > Hi Willem > >> On a send call with MSG_ZEROCOPY, the kernel pins the user pages and >> creates skbuff fragments directly from these pages. On tx completion, >> it notifies the socket owner that it is safe to modify memory by >> queuing a completion notification onto the socket error queue. What happens if the user writes to the pages while it's not safe? How about if you're writing to an interface or a route that has crypto involved and a malicious user can make the data change in the middle of a crypto operation, thus perhaps leaking the entire key? (I wouldn't be at all surprised if a lot of provably secure AEAD constructions are entirely compromised if an attacker can get the ciphertext and tag computed from a message that changed during the computation. I can see this working if you have a special type of skb that indicates that the data might be concurrently written and have all the normal skb APIs (including, especially, anything that clones it) make a copy first. --Andy
Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
[CC += linux-...@vger.kernel.org] Hi Willem This is a change to the kernel-user-space API. Please CC linux-...@vger.kernel.org on any future iterations of this patch. Thanks, Michael On Wed, Feb 22, 2017 at 5:38 PM, Willem de Bruijnwrote: > From: Willem de Bruijn > > RFCv2: > > I have received a few requests for status and rebased code of this > feature. We have been running this code internally, discovering and > fixing various bugs. With net-next closed, now seems like a good time > to share an updated patchset with fixes. The rebase from RFCv1/v4.2 > was mostly straightforward: mainly iov_iter changes. Full changelog: > > RFC -> RFCv2: > - review comment: do not loop skb with zerocopy frags onto rx: > add skb_orphan_frags_rx to orphan even refcounted frags > call this in __netif_receive_skb_core, deliver_skb and tun: > the same as 1080e512d44d ("net: orphan frags on receive") > - fix: hold an explicit sk reference on each notification skb. > previously relied on the reference (or wmem) held by the > data skb that would trigger notification, but this breaks > on skb_orphan. > - fix: when aborting a send, do not inc the zerocopy counter > this caused gaps in the notification chain > - fix: in packet with SOCK_DGRAM, pull ll headers before calling > zerocopy_sg_from_iter > - fix: if sock_zerocopy_realloc does not allow coalescing, > do not fail, just allocate a new ubuf > - fix: in tcp, check return value of second allocation attempt > - chg: allocate notification skbs from optmem > to avoid affecting tcp write queue accounting (TSQ) > - chg: limit #locked pages (ulimit) per user instead of per process > - chg: grow notification ids from 16 to 32 bit > - pass range [lo, hi] through 32 bit fields ee_info and ee_data > - chg: rebased to davem-net-next on top of v4.10-rc7 > - add: limit notification coalescing > sharing ubufs limits overhead, but delays notification until > the last packet is released, possibly unbounded. Add a cap. > - tests: add snd_zerocopy_lo pf_packet test > - tests: two bugfixes (add do_flush_tcp, ++sent not only in debug) > > The change to allocate notification skbuffs from optmem requires > ensuring that net.core.optmem is at least a few 100KB. To > experiment, run > > sysctl -w net.core.optmem_max=1048576 > > The snd_zerocopy_lo benchmarks reported in the individual patches were > rerun for RFCv2. To make them work, calls to skb_orphan_frags_rx were > replaced with skb_orphan_frags to allow looping to local sockets. The > netperf results below are also rerun with v2. > > In application load, copy avoidance shows a roughly 5% systemwide > reduction in cycles when streaming large flows and a 4-8% reduction in > wall clock time on early tensorflow test workloads. > > > Overview (from original RFC): > > Add zerocopy socket sendmsg() support with flag MSG_ZEROCOPY. > Implement the feature for TCP, UDP, RAW and packet sockets. This is > a generalization of a previous packet socket RFC patch > > http://patchwork.ozlabs.org/patch/413184/ > > On a send call with MSG_ZEROCOPY, the kernel pins the user pages and > creates skbuff fragments directly from these pages. On tx completion, > it notifies the socket owner that it is safe to modify memory by > queuing a completion notification onto the socket error queue. > > The kernel already implements such copy avoidance with vmsplice plus > splice and with ubuf_info for tun and virtio. Extend the second > with features required by TCP and others: reference counting to > support cloning (retransmit queue) and shared fragments (GSO) and > notification coalescing to handle corking. > > Notifications are queued onto the socket error queue as a range > range [N, N+m], where N is a per-socket counter incremented on each > successful zerocopy send call. > > * Performance > > The below table shows cycles reported by perf for a netperf process > sending a single 10 Gbps TCP_STREAM. The first three columns show > Mcycles spent in the netperf process context. The second three columns > show time spent systemwide (-a -C A,B) on the two cpus that run the > process and interrupt handler. Reported is the median of at least 3 > runs. std is a standard netperf, zc uses zerocopy and % is the ratio. > Netperf is pinned to cpu 2, network interrupts to cpu3, rps and rfs > are disabled and the kernel is booted with idle=halt. > > NETPERF=./netperf -t TCP_STREAM -H $host -T 2 -l 30 -- -m $size > > perf stat -e cycles $NETPERF > perf stat -C 2,3 -a -e cycles $NETPERF > > --process cycles-- cpu cycles >std zc % std zc % > 4K 27,609 11,217 41 49,217 39,175 79 > 16K 21,370 3,823 18 43,540 29,213 67 > 64K 20,557 2,312 11 42,189 26,910 64
Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
On Fri, Feb 24, 2017 at 6:03 PM, Alexei Starovoitovwrote: > On Wed, Feb 22, 2017 at 11:38:49AM -0500, Willem de Bruijn wrote: >> >> * Limitations / Known Issues >> >> - PF_INET6 is not yet supported. > > we struggled so far to make it work in our setups which are ipv6 only. > Looking at patches it seems the code should just work. > What particularly is missing ? > > Great stuff. Looking forward to net-next reopening :) Thanks for taking the feature for a spin! The udp and raw paths require separate ipv6 patches. TCP should indeed just work. I just ran a slightly modified snd_zerocopy_lo with good results as well as a hacked netperf to another host. I should have had ipv6 from the start, of course. Will add it before resubmitting when net-next opens. For now, quick hack to snd_zerocopy_lo.c: diff --git a/tools/testing/selftests/net/snd_zerocopy_lo.c b/tools/testing/selftests/net/snd_zerocopy_lo.c index 309b016a4fd5..38a165e2af64 100644 --- a/tools/testing/selftests/net/snd_zerocopy_lo.c +++ b/tools/testing/selftests/net/snd_zerocopy_lo.c @@ -453,7 +453,7 @@ static int do_setup_rx(int domain, int type, int protocol) static void do_setup_and_run(int domain, int type, int protocol) { - struct sockaddr_in addr; + struct sockaddr_in6 addr; socklen_t alen; int fdr, fdt, ret; @@ -468,8 +468,8 @@ static void do_setup_and_run(int domain, int type, int protocol) if (domain != PF_PACKET) { memset(, 0, sizeof(addr)); - addr.sin_family = AF_INET; - addr.sin_addr.s_addr = htonl(INADDR_LOOPBACK); + addr.sin6_family = AF_INET6; + addr.sin6_addr = in6addr_loopback; alen = sizeof(addr); if (bind(fdr, (void *) , sizeof(addr))) @@ -589,7 +589,7 @@ int main(int argc, char **argv) if (cfg_test_raw_hdrincl) do_setup_and_run(PF_INET, SOCK_RAW, IPPROTO_RAW); if (cfg_test_tcp) - do_setup_and_run(PF_INET, SOCK_STREAM, 0); + do_setup_and_run(PF_INET6, SOCK_STREAM, 0); Loopback zerocopy is disabled in RFCv2, so to use snd_zerocopy_lo to verify the feature requires this hack in skb_orphan_frags_rx: static inline int skb_orphan_frags_rx(struct sk_buff *skb, gfp_t gfp_mask) { - if (likely(!skb_zcopy(skb))) - return 0; - return skb_copy_ubufs(skb, gfp_mask); + return skb_orphan_frags(skb, gfp_mask); } With this change, I see $ ./snd_zerocopy_lo_ipv6 -t test socket(10, 1, 0) cpu: 23 rx=106364 (6637 MB) tx=106364 txc=0 rx=209736 (13088 MB) tx=209736 txc=0 rx=314524 (19627 MB) tx=314524 txc=0 rx=419424 (26174 MB) tx=419424 txc=0 OK. All tests passed $ ./snd_zerocopy_lo_ipv6 -t -z test socket(10, 1, 0) cpu: 23 rx=239792 (14964 MB) tx=239792 txc=239786 rx=477376 (29790 MB) tx=477376 txc=477370 rx=715016 (44620 MB) tx=715016 txc=715010 rx=952820 (59460 MB) tx=952820 txc=952814 OK. All tests passed In comparison, the same without the change $ ./snd_zerocopy_lo_ipv6 -t test socket(10, 1, 0) cpu: 23 rx=109908 (6858 MB) tx=109908 txc=0 rx=217100 (13548 MB) tx=217100 txc=0 rx=326584 (20380 MB) tx=326584 txc=0 rx=429568 (26807 MB) tx=429568 txc=0 OK. All tests passed $ ./snd_zerocopy_lo_ipv6 -t -z test socket(10, 1, 0) cpu: 23 rx=87636 (5468 MB) tx=87636 txc=87630 rx=174328 (10878 MB) tx=174328 txc=174322 rx=260360 (16247 MB) tx=260360 txc=260354 rx=346512 (21623 MB) tx=346512 txc=346506 Here the sk_buff hits the deep copy in skb_copy_ubufs called from __netif_receive_skb_core, which actually degrades performance versus copying as part of the sendmsg() syscall. The netperf change is to add MSG_ZEROCOPY to send() in send_tcp_stream and also adding a recvmsg(send_socket, , MSG_ERRQUEUE) to the same function, preferably called only once every N iterations. This does not take any additional explicit references on the send_ring element while data is in flight, so is really a hack, but ring contents should be static throughout the test. I did not modify the omni tests, so this requires building with --no-omni.
Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
On Wed, Feb 22, 2017 at 11:38:49AM -0500, Willem de Bruijn wrote: > > * Limitations / Known Issues > > - PF_INET6 is not yet supported. we struggled so far to make it work in our setups which are ipv6 only. Looking at patches it seems the code should just work. What particularly is missing ? Great stuff. Looking forward to net-next reopening :)
Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
From: Willem de BruijnDate: Wed, 22 Feb 2017 11:38:49 -0500 > From: Willem de Bruijn > > RFCv2: > > I have received a few requests for status and rebased code of this > feature. We have been running this code internally, discovering and > fixing various bugs. With net-next closed, now seems like a good time > to share an updated patchset with fixes. The rebase from RFCv1/v4.2 > was mostly straightforward: mainly iov_iter changes. Full changelog: I've been over this series once or twice and generally speaking it looks fine to me.
[PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
From: Willem de BruijnRFCv2: I have received a few requests for status and rebased code of this feature. We have been running this code internally, discovering and fixing various bugs. With net-next closed, now seems like a good time to share an updated patchset with fixes. The rebase from RFCv1/v4.2 was mostly straightforward: mainly iov_iter changes. Full changelog: RFC -> RFCv2: - review comment: do not loop skb with zerocopy frags onto rx: add skb_orphan_frags_rx to orphan even refcounted frags call this in __netif_receive_skb_core, deliver_skb and tun: the same as 1080e512d44d ("net: orphan frags on receive") - fix: hold an explicit sk reference on each notification skb. previously relied on the reference (or wmem) held by the data skb that would trigger notification, but this breaks on skb_orphan. - fix: when aborting a send, do not inc the zerocopy counter this caused gaps in the notification chain - fix: in packet with SOCK_DGRAM, pull ll headers before calling zerocopy_sg_from_iter - fix: if sock_zerocopy_realloc does not allow coalescing, do not fail, just allocate a new ubuf - fix: in tcp, check return value of second allocation attempt - chg: allocate notification skbs from optmem to avoid affecting tcp write queue accounting (TSQ) - chg: limit #locked pages (ulimit) per user instead of per process - chg: grow notification ids from 16 to 32 bit - pass range [lo, hi] through 32 bit fields ee_info and ee_data - chg: rebased to davem-net-next on top of v4.10-rc7 - add: limit notification coalescing sharing ubufs limits overhead, but delays notification until the last packet is released, possibly unbounded. Add a cap. - tests: add snd_zerocopy_lo pf_packet test - tests: two bugfixes (add do_flush_tcp, ++sent not only in debug) The change to allocate notification skbuffs from optmem requires ensuring that net.core.optmem is at least a few 100KB. To experiment, run sysctl -w net.core.optmem_max=1048576 The snd_zerocopy_lo benchmarks reported in the individual patches were rerun for RFCv2. To make them work, calls to skb_orphan_frags_rx were replaced with skb_orphan_frags to allow looping to local sockets. The netperf results below are also rerun with v2. In application load, copy avoidance shows a roughly 5% systemwide reduction in cycles when streaming large flows and a 4-8% reduction in wall clock time on early tensorflow test workloads. Overview (from original RFC): Add zerocopy socket sendmsg() support with flag MSG_ZEROCOPY. Implement the feature for TCP, UDP, RAW and packet sockets. This is a generalization of a previous packet socket RFC patch http://patchwork.ozlabs.org/patch/413184/ On a send call with MSG_ZEROCOPY, the kernel pins the user pages and creates skbuff fragments directly from these pages. On tx completion, it notifies the socket owner that it is safe to modify memory by queuing a completion notification onto the socket error queue. The kernel already implements such copy avoidance with vmsplice plus splice and with ubuf_info for tun and virtio. Extend the second with features required by TCP and others: reference counting to support cloning (retransmit queue) and shared fragments (GSO) and notification coalescing to handle corking. Notifications are queued onto the socket error queue as a range range [N, N+m], where N is a per-socket counter incremented on each successful zerocopy send call. * Performance The below table shows cycles reported by perf for a netperf process sending a single 10 Gbps TCP_STREAM. The first three columns show Mcycles spent in the netperf process context. The second three columns show time spent systemwide (-a -C A,B) on the two cpus that run the process and interrupt handler. Reported is the median of at least 3 runs. std is a standard netperf, zc uses zerocopy and % is the ratio. Netperf is pinned to cpu 2, network interrupts to cpu3, rps and rfs are disabled and the kernel is booted with idle=halt. NETPERF=./netperf -t TCP_STREAM -H $host -T 2 -l 30 -- -m $size perf stat -e cycles $NETPERF perf stat -C 2,3 -a -e cycles $NETPERF --process cycles-- cpu cycles std zc % std zc % 4K 27,609 11,217 41 49,217 39,175 79 16K 21,370 3,823 18 43,540 29,213 67 64K 20,557 2,312 11 42,189 26,910 64 256K21,110 2,134 10 43,006 27,104 63 1M 20,987 1,610 8 42,759 25,931 61 Perf record indicates the main source of these differences. Process cycles only at 1M writes (perf record; perf report -n): std: Samples: 42K of event 'cycles', Event count (approx.): 21258597313 79.41% 33884 netperf [kernel.kallsyms] [k] copy_user_generic_string 3.27% 1396