Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-03-02 Thread Andy Lutomirski
On Tue, Feb 28, 2017 at 7:28 PM, David Miller  wrote:
> From: Andy Lutomirski 
> Date: Tue, 28 Feb 2017 13:06:49 -0800
>
>> On Tue, Feb 28, 2017 at 12:43 PM, Willem de Bruijn
>>  wrote:
>>> On Tue, Feb 28, 2017 at 2:46 PM, Andy Lutomirski  
>>> wrote:
 On Mon, Feb 27, 2017 at 10:57 AM, Michael Kerrisk
  wrote:
> [CC += linux-...@vger.kernel.org]
>
> Hi Willem
>

>> On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
>> creates skbuff fragments directly from these pages. On tx completion,
>> it notifies the socket owner that it is safe to modify memory by
>> queuing a completion notification onto the socket error queue.

 What happens if the user writes to the pages while it's not safe?

 How about if you're writing to an interface or a route that has crypto
 involved and a malicious user can make the data change in the middle
 of a crypto operation, thus perhaps leaking the entire key?  (I
 wouldn't be at all surprised if a lot of provably secure AEAD
 constructions are entirely compromised if an attacker can get the
 ciphertext and tag computed from a message that changed during the
 computation.
>>>
>>> Operations that read or write payload, such as this crypto example,
>>> but also ebpf in tc or iptables, for instance, demand a deep copy using
>>> skb_copy_ubufs before the operation.
>>>
>>> This blacklist approach requires caution, but these paths should be
>>> few and countable. It is not possible to predict at the socket layer
>>> whether a packet will encounter any such operation, so white-listing
>>> a subset of end-to-end paths is not practical.
>>
>> How about hardware that malfunctions if the packet changes out from
>> under it?  A whitelist seems quite a bit safer.
>
> These device are already choking, because as I stated this can already
> be done via sendfile().
>
> Networking card wise this isn't an issue, chips bring the entire packet
> into their FIFO, compute checksums on the fly mid-stream, and then write
> the 16-bit checksum field before starting to write the packet onto the
> wire.
>
> I think this is completely a non-issue, and we thought about this right
> from the start when sendfile() support was added nearly two decades ago.
> If network cards from back then didn't crap out in this situation I
> think the ones out there now are probably ok.

Fair enough.


Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-02-28 Thread Willem de Bruijn
On Tue, Feb 28, 2017 at 7:28 PM, Tom Herbert  wrote:
> On Tue, Feb 28, 2017 at 3:22 PM, Eric Dumazet  wrote:
>> On Tue, 2017-02-28 at 14:52 -0800, Andy Lutomirski wrote:
>>
>>> The user pages are a gift to the kernel.  The application  may  not
>>> modify this memory ever, otherwise the page cache and on-disk data may
>>> differ.
>>>
>>> This is just not okay IMO.
>>
>> TCP works just fine in this case.
>>
>> TX checksum will be computed by the NIC after/while data is copied.
>>
>> If really the application changes the data, that will not cause any
>> problems, other than user side consistency.
>>
>> This is why we require a copy (for all buffers that came from zero-copy)
>> if network stack hits a device that can not offload TX checksum.
>>
>> Even pwrite() does not guarantee consistency if multiple threads are
>> using it on overlapping regions.
>>
> The Mellanox team working on TLS offload pointed out to us that if
> data is changed for a retransmit then it becomes trivial for someone
> snooping to break the encryption. Sounds pretty scary and it would be
> a shame if we couldn't use zero-copy in that use case :-( Hopefully we
> can find a solution...
>

This requires collusion by the process initiating the zerocopy send
to help the entity snooping the link. That could be an attack on admin
configured tunnels, but user-directed encryption offload like AF_TLS
can still use zerocopy.


Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-02-28 Thread Eric Dumazet
On Tue, 2017-02-28 at 22:28 -0500, David Miller wrote:

> These device are already choking, because as I stated this can already
> be done via sendfile().
> 
> Networking card wise this isn't an issue, chips bring the entire packet
> into their FIFO, compute checksums on the fly mid-stream, and then write
> the 16-bit checksum field before starting to write the packet onto the
> wire.
> 
> I think this is completely a non-issue, and we thought about this right
> from the start when sendfile() support was added nearly two decades ago.
> If network cards from back then didn't crap out in this situation I
> think the ones out there now are probably ok.

Well, we had to fix one issue with GSO fall back about 4 years ago.

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=cef401de7be8c4e155c6746bfccf721a4fa5fab9

So extra scrutiny would be nice.

Zero copy is incredibly hard to get right.





Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-02-28 Thread David Miller
From: Andy Lutomirski 
Date: Tue, 28 Feb 2017 13:06:49 -0800

> On Tue, Feb 28, 2017 at 12:43 PM, Willem de Bruijn
>  wrote:
>> On Tue, Feb 28, 2017 at 2:46 PM, Andy Lutomirski  wrote:
>>> On Mon, Feb 27, 2017 at 10:57 AM, Michael Kerrisk
>>>  wrote:
 [CC += linux-...@vger.kernel.org]

 Hi Willem

>>>
> On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
> creates skbuff fragments directly from these pages. On tx completion,
> it notifies the socket owner that it is safe to modify memory by
> queuing a completion notification onto the socket error queue.
>>>
>>> What happens if the user writes to the pages while it's not safe?
>>>
>>> How about if you're writing to an interface or a route that has crypto
>>> involved and a malicious user can make the data change in the middle
>>> of a crypto operation, thus perhaps leaking the entire key?  (I
>>> wouldn't be at all surprised if a lot of provably secure AEAD
>>> constructions are entirely compromised if an attacker can get the
>>> ciphertext and tag computed from a message that changed during the
>>> computation.
>>
>> Operations that read or write payload, such as this crypto example,
>> but also ebpf in tc or iptables, for instance, demand a deep copy using
>> skb_copy_ubufs before the operation.
>>
>> This blacklist approach requires caution, but these paths should be
>> few and countable. It is not possible to predict at the socket layer
>> whether a packet will encounter any such operation, so white-listing
>> a subset of end-to-end paths is not practical.
> 
> How about hardware that malfunctions if the packet changes out from
> under it?  A whitelist seems quite a bit safer.

These device are already choking, because as I stated this can already
be done via sendfile().

Networking card wise this isn't an issue, chips bring the entire packet
into their FIFO, compute checksums on the fly mid-stream, and then write
the 16-bit checksum field before starting to write the packet onto the
wire.

I think this is completely a non-issue, and we thought about this right
from the start when sendfile() support was added nearly two decades ago.
If network cards from back then didn't crap out in this situation I
think the ones out there now are probably ok.


Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-02-28 Thread David Miller
From: Andy Lutomirski 
Date: Tue, 28 Feb 2017 11:46:23 -0800

> On Mon, Feb 27, 2017 at 10:57 AM, Michael Kerrisk
>  wrote:
>> [CC += linux-...@vger.kernel.org]
>>
>> Hi Willem
>>
> 
>>> On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
>>> creates skbuff fragments directly from these pages. On tx completion,
>>> it notifies the socket owner that it is safe to modify memory by
>>> queuing a completion notification onto the socket error queue.
> 
> What happens if the user writes to the pages while it's not safe?

Just want to mention that this ability to write to data behind a
network send's back is not a new thing added by MSG_ZEROCOPY.

All of this is already possible with sendfile().  The pages can be
written to completely asynchronously to the data being pulled out of
the page cache into the transmit path.

The crypto case is interesting, but that is a seperate discussion
about an existing problem rather than something specifically new to
the MSG_ZEROCOPY changes.

Thanks.


Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-02-28 Thread Tom Herbert
On Tue, Feb 28, 2017 at 4:58 PM, Willem de Bruijn
 wrote:
> On Tue, Feb 28, 2017 at 7:28 PM, Tom Herbert  wrote:
>> On Tue, Feb 28, 2017 at 3:22 PM, Eric Dumazet  wrote:
>>> On Tue, 2017-02-28 at 14:52 -0800, Andy Lutomirski wrote:
>>>
 The user pages are a gift to the kernel.  The application  may  not
 modify this memory ever, otherwise the page cache and on-disk data may
 differ.

 This is just not okay IMO.
>>>
>>> TCP works just fine in this case.
>>>
>>> TX checksum will be computed by the NIC after/while data is copied.
>>>
>>> If really the application changes the data, that will not cause any
>>> problems, other than user side consistency.
>>>
>>> This is why we require a copy (for all buffers that came from zero-copy)
>>> if network stack hits a device that can not offload TX checksum.
>>>
>>> Even pwrite() does not guarantee consistency if multiple threads are
>>> using it on overlapping regions.
>>>
>> The Mellanox team working on TLS offload pointed out to us that if
>> data is changed for a retransmit then it becomes trivial for someone
>> snooping to break the encryption. Sounds pretty scary and it would be
>> a shame if we couldn't use zero-copy in that use case :-( Hopefully we
>> can find a solution...
>>
>
> This requires collusion by the process initiating the zerocopy send
> to help the entity snooping the link. That could be an attack on admin
> configured tunnels, but user-directed encryption offload like AF_TLS
> can still use zerocopy.

Yes, but we can't trust the user to always understand or correctly
implement the semantic nuances when security is involved. If we can't
provide a  robust API then the only recourse is to not allow zero copy
in that case. We could suggest COW to solve all problems, but I think
we know where the conversation will go ;-)

Tom


Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-02-28 Thread Eric Dumazet
On Tue, 2017-02-28 at 14:25 -0800, Andy Lutomirski wrote:
> On Tue, Feb 28, 2017 at 1:47 PM, Eric Dumazet  wrote:
> > On Tue, 2017-02-28 at 13:09 -0800, Andy Lutomirski wrote:
> >
> >> Does this mean that a user program that does a zerocopy send can cause
> >> a retransmitted segment to contain different data than the original
> >> segment?  If so, is that okay?
> >
> > Same remark applies to sendfile() already
> 
> True.
> 
> >, or other zero copy modes
> > (vmsplice() + splice() )
> 
> I hate vmsplice().  I thought I remembered it being essentially
> disabled at some point due to security problems.

Right, zero copy is hard ;)

vmsplice() is not disabled in current kernels, unless I missed
something.






Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-02-28 Thread Tom Herbert
On Tue, Feb 28, 2017 at 3:22 PM, Eric Dumazet  wrote:
> On Tue, 2017-02-28 at 14:52 -0800, Andy Lutomirski wrote:
>
>> The user pages are a gift to the kernel.  The application  may  not
>> modify this memory ever, otherwise the page cache and on-disk data may
>> differ.
>>
>> This is just not okay IMO.
>
> TCP works just fine in this case.
>
> TX checksum will be computed by the NIC after/while data is copied.
>
> If really the application changes the data, that will not cause any
> problems, other than user side consistency.
>
> This is why we require a copy (for all buffers that came from zero-copy)
> if network stack hits a device that can not offload TX checksum.
>
> Even pwrite() does not guarantee consistency if multiple threads are
> using it on overlapping regions.
>
The Mellanox team working on TLS offload pointed out to us that if
data is changed for a retransmit then it becomes trivial for someone
snooping to break the encryption. Sounds pretty scary and it would be
a shame if we couldn't use zero-copy in that use case :-( Hopefully we
can find a solution...

Tom

>
>


Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-02-28 Thread Eric Dumazet
On Tue, 2017-02-28 at 16:28 -0800, Tom Herbert wrote:

> The Mellanox team working on TLS offload pointed out to us that if
> data is changed for a retransmit then it becomes trivial for someone
> snooping to break the encryption. Sounds pretty scary and it would be
> a shame if we couldn't use zero-copy in that use case :-( Hopefully we
> can find a solution...

Right, this is why offloading encryption over TCP is also hard ;)





Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-02-28 Thread Eric Dumazet
On Tue, 2017-02-28 at 14:52 -0800, Andy Lutomirski wrote:

> The user pages are a gift to the kernel.  The application  may  not
> modify this memory ever, otherwise the page cache and on-disk data may
> differ.
> 
> This is just not okay IMO.

TCP works just fine in this case.

TX checksum will be computed by the NIC after/while data is copied.

If really the application changes the data, that will not cause any
problems, other than user side consistency.

This is why we require a copy (for all buffers that came from zero-copy)
if network stack hits a device that can not offload TX checksum.

Even pwrite() does not guarantee consistency if multiple threads are
using it on overlapping regions.





Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-02-28 Thread Andy Lutomirski
On Tue, Feb 28, 2017 at 2:40 PM, Eric Dumazet  wrote:
> On Tue, 2017-02-28 at 14:25 -0800, Andy Lutomirski wrote:
>> On Tue, Feb 28, 2017 at 1:47 PM, Eric Dumazet  wrote:
>> > On Tue, 2017-02-28 at 13:09 -0800, Andy Lutomirski wrote:
>> >
>> >> Does this mean that a user program that does a zerocopy send can cause
>> >> a retransmitted segment to contain different data than the original
>> >> segment?  If so, is that okay?
>> >
>> > Same remark applies to sendfile() already
>>
>> True.
>>
>> >, or other zero copy modes
>> > (vmsplice() + splice() )
>>
>> I hate vmsplice().  I thought I remembered it being essentially
>> disabled at some point due to security problems.
>
> Right, zero copy is hard ;)
>
> vmsplice() is not disabled in current kernels, unless I missed
> something.
>

I think you're right.  That being said, from the man page:

The user pages are a gift to the kernel.  The application  may  not
modify this memory ever, otherwise the page cache and on-disk data may
differ.

This is just not okay IMO.


Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-02-28 Thread Andy Lutomirski
On Tue, Feb 28, 2017 at 1:47 PM, Eric Dumazet  wrote:
> On Tue, 2017-02-28 at 13:09 -0800, Andy Lutomirski wrote:
>
>> Does this mean that a user program that does a zerocopy send can cause
>> a retransmitted segment to contain different data than the original
>> segment?  If so, is that okay?
>
> Same remark applies to sendfile() already

True.

>, or other zero copy modes
> (vmsplice() + splice() )

I hate vmsplice().  I thought I remembered it being essentially
disabled at some point due to security problems.

>


Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-02-28 Thread Eric Dumazet
On Tue, 2017-02-28 at 13:09 -0800, Andy Lutomirski wrote:

> Does this mean that a user program that does a zerocopy send can cause
> a retransmitted segment to contain different data than the original
> segment?  If so, is that okay?

Same remark applies to sendfile() already, or other zero copy modes
(vmsplice() + splice() )






Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-02-28 Thread Andy Lutomirski
On Tue, Feb 28, 2017 at 12:43 PM, Willem de Bruijn
 wrote:
>
>> I can see this working if you have a special type of skb that
>> indicates that the data might be concurrently written and have all the
>> normal skb APIs (including, especially, anything that clones it) make
>> a copy first.
>
> Support for cloned skbs is required for TCP, both at tcp_transmit_skb
> and segmentation offload. Patch 4 especially adds reference counting
> of shared pages across clones and other sk_buff operations like
> pskb_expand_head. This still allows for deep copy (skb_copy_ubufs)
> on clones in specific datapaths like the above.

Does this mean that a user program that does a zerocopy send can cause
a retransmitted segment to contain different data than the original
segment?  If so, is that okay?


Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-02-28 Thread Willem de Bruijn
>>> I can see this working if you have a special type of skb that
>>> indicates that the data might be concurrently written and have all the
>>> normal skb APIs (including, especially, anything that clones it) make
>>> a copy first.
>>
>> Support for cloned skbs is required for TCP, both at tcp_transmit_skb
>> and segmentation offload. Patch 4 especially adds reference counting
>> of shared pages across clones and other sk_buff operations like
>> pskb_expand_head. This still allows for deep copy (skb_copy_ubufs)
>> on clones in specific datapaths like the above.
>
> Does this mean that a user program that does a zerocopy send can cause
> a retransmitted segment to contain different data than the original
> segment? If so, is that okay?

That is possible, indeed. The bytestream at the receiver is then
likely undefined. Though integrity of the tcp receive stack should
not be affected. A valid question is whether the stack should protect
against such users. The pattern is reminiscent of evasion attacks. In
the limited case, privileged users already can generate this data
pattern, of course.


Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-02-28 Thread Andy Lutomirski
On Tue, Feb 28, 2017 at 12:43 PM, Willem de Bruijn
 wrote:
> On Tue, Feb 28, 2017 at 2:46 PM, Andy Lutomirski  wrote:
>> On Mon, Feb 27, 2017 at 10:57 AM, Michael Kerrisk
>>  wrote:
>>> [CC += linux-...@vger.kernel.org]
>>>
>>> Hi Willem
>>>
>>
 On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
 creates skbuff fragments directly from these pages. On tx completion,
 it notifies the socket owner that it is safe to modify memory by
 queuing a completion notification onto the socket error queue.
>>
>> What happens if the user writes to the pages while it's not safe?
>>
>> How about if you're writing to an interface or a route that has crypto
>> involved and a malicious user can make the data change in the middle
>> of a crypto operation, thus perhaps leaking the entire key?  (I
>> wouldn't be at all surprised if a lot of provably secure AEAD
>> constructions are entirely compromised if an attacker can get the
>> ciphertext and tag computed from a message that changed during the
>> computation.
>
> Operations that read or write payload, such as this crypto example,
> but also ebpf in tc or iptables, for instance, demand a deep copy using
> skb_copy_ubufs before the operation.
>
> This blacklist approach requires caution, but these paths should be
> few and countable. It is not possible to predict at the socket layer
> whether a packet will encounter any such operation, so white-listing
> a subset of end-to-end paths is not practical.

How about hardware that malfunctions if the packet changes out from
under it?  A whitelist seems quite a bit safer.

--Andy


Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-02-28 Thread Willem de Bruijn
On Tue, Feb 28, 2017 at 2:46 PM, Andy Lutomirski  wrote:
> On Mon, Feb 27, 2017 at 10:57 AM, Michael Kerrisk
>  wrote:
>> [CC += linux-...@vger.kernel.org]
>>
>> Hi Willem
>>
>
>>> On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
>>> creates skbuff fragments directly from these pages. On tx completion,
>>> it notifies the socket owner that it is safe to modify memory by
>>> queuing a completion notification onto the socket error queue.
>
> What happens if the user writes to the pages while it's not safe?
>
> How about if you're writing to an interface or a route that has crypto
> involved and a malicious user can make the data change in the middle
> of a crypto operation, thus perhaps leaking the entire key?  (I
> wouldn't be at all surprised if a lot of provably secure AEAD
> constructions are entirely compromised if an attacker can get the
> ciphertext and tag computed from a message that changed during the
> computation.

Operations that read or write payload, such as this crypto example,
but also ebpf in tc or iptables, for instance, demand a deep copy using
skb_copy_ubufs before the operation.

This blacklist approach requires caution, but these paths should be
few and countable. It is not possible to predict at the socket layer
whether a packet will encounter any such operation, so white-listing
a subset of end-to-end paths is not practical.

> I can see this working if you have a special type of skb that
> indicates that the data might be concurrently written and have all the
> normal skb APIs (including, especially, anything that clones it) make
> a copy first.

Support for cloned skbs is required for TCP, both at tcp_transmit_skb
and segmentation offload. Patch 4 especially adds reference counting
of shared pages across clones and other sk_buff operations like
pskb_expand_head. This still allows for deep copy (skb_copy_ubufs)
on clones in specific datapaths like the above.


Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-02-28 Thread Andy Lutomirski
On Mon, Feb 27, 2017 at 10:57 AM, Michael Kerrisk
 wrote:
> [CC += linux-...@vger.kernel.org]
>
> Hi Willem
>

>> On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
>> creates skbuff fragments directly from these pages. On tx completion,
>> it notifies the socket owner that it is safe to modify memory by
>> queuing a completion notification onto the socket error queue.

What happens if the user writes to the pages while it's not safe?

How about if you're writing to an interface or a route that has crypto
involved and a malicious user can make the data change in the middle
of a crypto operation, thus perhaps leaking the entire key?  (I
wouldn't be at all surprised if a lot of provably secure AEAD
constructions are entirely compromised if an attacker can get the
ciphertext and tag computed from a message that changed during the
computation.

I can see this working if you have a special type of skb that
indicates that the data might be concurrently written and have all the
normal skb APIs (including, especially, anything that clones it) make
a copy first.

--Andy


Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-02-27 Thread Michael Kerrisk
[CC += linux-...@vger.kernel.org]

Hi Willem

This is a change to the kernel-user-space API. Please CC
linux-...@vger.kernel.org on any future iterations of this patch.

Thanks,

Michael



On Wed, Feb 22, 2017 at 5:38 PM, Willem de Bruijn
 wrote:
> From: Willem de Bruijn 
>
> RFCv2:
>
> I have received a few requests for status and rebased code of this
> feature. We have been running this code internally, discovering and
> fixing various bugs. With net-next closed, now seems like a good time
> to share an updated patchset with fixes. The rebase from RFCv1/v4.2
> was mostly straightforward: mainly iov_iter changes. Full changelog:
>
>   RFC -> RFCv2:
> - review comment: do not loop skb with zerocopy frags onto rx:
>   add skb_orphan_frags_rx to orphan even refcounted frags
>   call this in __netif_receive_skb_core, deliver_skb and tun:
>   the same as 1080e512d44d ("net: orphan frags on receive")
> - fix: hold an explicit sk reference on each notification skb.
>   previously relied on the reference (or wmem) held by the
>   data skb that would trigger notification, but this breaks
>   on skb_orphan.
> - fix: when aborting a send, do not inc the zerocopy counter
>   this caused gaps in the notification chain
> - fix: in packet with SOCK_DGRAM, pull ll headers before calling
>   zerocopy_sg_from_iter
> - fix: if sock_zerocopy_realloc does not allow coalescing,
>   do not fail, just allocate a new ubuf
> - fix: in tcp, check return value of second allocation attempt
> - chg: allocate notification skbs from optmem
>   to avoid affecting tcp write queue accounting (TSQ)
> - chg: limit #locked pages (ulimit) per user instead of per process
> - chg: grow notification ids from 16 to 32 bit
>   - pass range [lo, hi] through 32 bit fields ee_info and ee_data
> - chg: rebased to davem-net-next on top of v4.10-rc7
> - add: limit notification coalescing
>   sharing ubufs limits overhead, but delays notification until
>   the last packet is released, possibly unbounded. Add a cap.
> - tests: add snd_zerocopy_lo pf_packet test
> - tests: two bugfixes (add do_flush_tcp, ++sent not only in debug)
>
> The change to allocate notification skbuffs from optmem requires
> ensuring that net.core.optmem is at least a few 100KB. To
> experiment, run
>
>   sysctl -w net.core.optmem_max=1048576
>
> The snd_zerocopy_lo benchmarks reported in the individual patches were
> rerun for RFCv2. To make them work, calls to skb_orphan_frags_rx were
> replaced with skb_orphan_frags to allow looping to local sockets. The
> netperf results below are also rerun with v2.
>
> In application load, copy avoidance shows a roughly 5% systemwide
> reduction in cycles when streaming large flows and a 4-8% reduction in
> wall clock time on early tensorflow test workloads.
>
>
> Overview (from original RFC):
>
> Add zerocopy socket sendmsg() support with flag MSG_ZEROCOPY.
> Implement the feature for TCP, UDP, RAW and packet sockets. This is
> a generalization of a previous packet socket RFC patch
>
>   http://patchwork.ozlabs.org/patch/413184/
>
> On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
> creates skbuff fragments directly from these pages. On tx completion,
> it notifies the socket owner that it is safe to modify memory by
> queuing a completion notification onto the socket error queue.
>
> The kernel already implements such copy avoidance with vmsplice plus
> splice and with ubuf_info for tun and virtio. Extend the second
> with features required by TCP and others: reference counting to
> support cloning (retransmit queue) and shared fragments (GSO) and
> notification coalescing to handle corking.
>
> Notifications are queued onto the socket error queue as a range
> range [N, N+m], where N is a per-socket counter incremented on each
> successful zerocopy send call.
>
> * Performance
>
> The below table shows cycles reported by perf for a netperf process
> sending a single 10 Gbps TCP_STREAM. The first three columns show
> Mcycles spent in the netperf process context. The second three columns
> show time spent systemwide (-a -C A,B) on the two cpus that run the
> process and interrupt handler. Reported is the median of at least 3
> runs. std is a standard netperf, zc uses zerocopy and % is the ratio.
> Netperf is pinned to cpu 2, network interrupts to cpu3, rps and rfs
> are disabled and the kernel is booted with idle=halt.
>
> NETPERF=./netperf -t TCP_STREAM -H $host -T 2 -l 30 -- -m $size
>
> perf stat -e cycles $NETPERF
> perf stat -C 2,3 -a -e cycles $NETPERF
>
> --process cycles--  cpu cycles
>std  zc   %  std zc   %
> 4K  27,609  11,217  41  49,217  39,175  79
> 16K 21,370   3,823  18  43,540  29,213  67
> 64K 20,557   2,312  11  42,189  26,910  64

Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-02-24 Thread Willem de Bruijn
On Fri, Feb 24, 2017 at 6:03 PM, Alexei Starovoitov
 wrote:
> On Wed, Feb 22, 2017 at 11:38:49AM -0500, Willem de Bruijn wrote:
>>
>> * Limitations / Known Issues
>>
>> - PF_INET6 is not yet supported.
>
> we struggled so far to make it work in our setups which are ipv6 only.
> Looking at patches it seems the code should just work.
> What particularly is missing ?
>
> Great stuff. Looking forward to net-next reopening :)

Thanks for taking the feature for a spin!

The udp and raw paths require separate ipv6 patches. TCP should indeed
just work. I just ran a slightly modified snd_zerocopy_lo with good
results as well as a hacked netperf to another host.

I should have had ipv6 from the start, of course. Will add it before
resubmitting when net-next opens. For now, quick hack to
snd_zerocopy_lo.c:

diff --git a/tools/testing/selftests/net/snd_zerocopy_lo.c
b/tools/testing/selftests/net/snd_zerocopy_lo.c
index 309b016a4fd5..38a165e2af64 100644
--- a/tools/testing/selftests/net/snd_zerocopy_lo.c
+++ b/tools/testing/selftests/net/snd_zerocopy_lo.c
@@ -453,7 +453,7 @@ static int do_setup_rx(int domain, int type, int protocol)

 static void do_setup_and_run(int domain, int type, int protocol)
 {
-   struct sockaddr_in addr;
+   struct sockaddr_in6 addr;
socklen_t alen;
int fdr, fdt, ret;

@@ -468,8 +468,8 @@ static void do_setup_and_run(int domain, int type,
int protocol)

if (domain != PF_PACKET) {
memset(, 0, sizeof(addr));
-   addr.sin_family = AF_INET;
-   addr.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
+   addr.sin6_family = AF_INET6;
+   addr.sin6_addr = in6addr_loopback;
alen = sizeof(addr);

if (bind(fdr, (void *) , sizeof(addr)))
@@ -589,7 +589,7 @@ int main(int argc, char **argv)
if (cfg_test_raw_hdrincl)
do_setup_and_run(PF_INET, SOCK_RAW, IPPROTO_RAW);
if (cfg_test_tcp)
-   do_setup_and_run(PF_INET, SOCK_STREAM, 0);
+   do_setup_and_run(PF_INET6, SOCK_STREAM, 0);


Loopback zerocopy is disabled in RFCv2, so to use snd_zerocopy_lo to verify the
feature requires this hack in skb_orphan_frags_rx:

 static inline int skb_orphan_frags_rx(struct sk_buff *skb, gfp_t gfp_mask)
 {
-   if (likely(!skb_zcopy(skb)))
-   return 0;
-   return skb_copy_ubufs(skb, gfp_mask);
+   return skb_orphan_frags(skb, gfp_mask);
 }

With this change, I see

$ ./snd_zerocopy_lo_ipv6 -t
test socket(10, 1, 0)
cpu: 23
rx=106364 (6637 MB) tx=106364 txc=0
rx=209736 (13088 MB) tx=209736 txc=0
rx=314524 (19627 MB) tx=314524 txc=0
rx=419424 (26174 MB) tx=419424 txc=0
OK. All tests passed

$ ./snd_zerocopy_lo_ipv6 -t -z
test socket(10, 1, 0)
cpu: 23
rx=239792 (14964 MB) tx=239792 txc=239786
rx=477376 (29790 MB) tx=477376 txc=477370
rx=715016 (44620 MB) tx=715016 txc=715010
rx=952820 (59460 MB) tx=952820 txc=952814
OK. All tests passed

In comparison, the same without the change

$ ./snd_zerocopy_lo_ipv6 -t
test socket(10, 1, 0)
cpu: 23
rx=109908 (6858 MB) tx=109908 txc=0
rx=217100 (13548 MB) tx=217100 txc=0
rx=326584 (20380 MB) tx=326584 txc=0
rx=429568 (26807 MB) tx=429568 txc=0
OK. All tests passed

$ ./snd_zerocopy_lo_ipv6 -t  -z
test socket(10, 1, 0)
cpu: 23
rx=87636 (5468 MB) tx=87636 txc=87630
rx=174328 (10878 MB) tx=174328 txc=174322
rx=260360 (16247 MB) tx=260360 txc=260354
rx=346512 (21623 MB) tx=346512 txc=346506

Here the sk_buff hits the deep copy in skb_copy_ubufs called from
__netif_receive_skb_core, which actually degrades performance versus
copying as part of the sendmsg() syscall.

The netperf change is to add MSG_ZEROCOPY to send() in send_tcp_stream
and also adding a recvmsg(send_socket, , MSG_ERRQUEUE) to the same
function, preferably called only once every N iterations. This does
not take any additional explicit references on the send_ring element
while data is in flight, so is really a hack, but ring contents should
be static throughout the test. I did not modify the omni tests, so
this requires building with --no-omni.


Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-02-24 Thread Alexei Starovoitov
On Wed, Feb 22, 2017 at 11:38:49AM -0500, Willem de Bruijn wrote:
> 
> * Limitations / Known Issues
> 
> - PF_INET6 is not yet supported.

we struggled so far to make it work in our setups which are ipv6 only.
Looking at patches it seems the code should just work.
What particularly is missing ?

Great stuff. Looking forward to net-next reopening :)



Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-02-23 Thread David Miller
From: Willem de Bruijn 
Date: Wed, 22 Feb 2017 11:38:49 -0500

> From: Willem de Bruijn 
> 
> RFCv2:
> 
> I have received a few requests for status and rebased code of this
> feature. We have been running this code internally, discovering and
> fixing various bugs. With net-next closed, now seems like a good time
> to share an updated patchset with fixes. The rebase from RFCv1/v4.2
> was mostly straightforward: mainly iov_iter changes. Full changelog:

I've been over this series once or twice and generally speaking it looks
fine to me.


[PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-02-22 Thread Willem de Bruijn
From: Willem de Bruijn 

RFCv2:

I have received a few requests for status and rebased code of this
feature. We have been running this code internally, discovering and
fixing various bugs. With net-next closed, now seems like a good time
to share an updated patchset with fixes. The rebase from RFCv1/v4.2
was mostly straightforward: mainly iov_iter changes. Full changelog:

  RFC -> RFCv2:
- review comment: do not loop skb with zerocopy frags onto rx:
  add skb_orphan_frags_rx to orphan even refcounted frags
  call this in __netif_receive_skb_core, deliver_skb and tun:
  the same as 1080e512d44d ("net: orphan frags on receive")
- fix: hold an explicit sk reference on each notification skb.
  previously relied on the reference (or wmem) held by the
  data skb that would trigger notification, but this breaks
  on skb_orphan.
- fix: when aborting a send, do not inc the zerocopy counter
  this caused gaps in the notification chain
- fix: in packet with SOCK_DGRAM, pull ll headers before calling
  zerocopy_sg_from_iter
- fix: if sock_zerocopy_realloc does not allow coalescing,
  do not fail, just allocate a new ubuf
- fix: in tcp, check return value of second allocation attempt
- chg: allocate notification skbs from optmem
  to avoid affecting tcp write queue accounting (TSQ)
- chg: limit #locked pages (ulimit) per user instead of per process
- chg: grow notification ids from 16 to 32 bit
  - pass range [lo, hi] through 32 bit fields ee_info and ee_data
- chg: rebased to davem-net-next on top of v4.10-rc7
- add: limit notification coalescing
  sharing ubufs limits overhead, but delays notification until
  the last packet is released, possibly unbounded. Add a cap. 
- tests: add snd_zerocopy_lo pf_packet test
- tests: two bugfixes (add do_flush_tcp, ++sent not only in debug)

The change to allocate notification skbuffs from optmem requires
ensuring that net.core.optmem is at least a few 100KB. To
experiment, run

  sysctl -w net.core.optmem_max=1048576

The snd_zerocopy_lo benchmarks reported in the individual patches were
rerun for RFCv2. To make them work, calls to skb_orphan_frags_rx were
replaced with skb_orphan_frags to allow looping to local sockets. The
netperf results below are also rerun with v2.

In application load, copy avoidance shows a roughly 5% systemwide
reduction in cycles when streaming large flows and a 4-8% reduction in
wall clock time on early tensorflow test workloads.


Overview (from original RFC):

Add zerocopy socket sendmsg() support with flag MSG_ZEROCOPY.
Implement the feature for TCP, UDP, RAW and packet sockets. This is
a generalization of a previous packet socket RFC patch

  http://patchwork.ozlabs.org/patch/413184/

On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
creates skbuff fragments directly from these pages. On tx completion,
it notifies the socket owner that it is safe to modify memory by
queuing a completion notification onto the socket error queue.

The kernel already implements such copy avoidance with vmsplice plus
splice and with ubuf_info for tun and virtio. Extend the second
with features required by TCP and others: reference counting to
support cloning (retransmit queue) and shared fragments (GSO) and
notification coalescing to handle corking.

Notifications are queued onto the socket error queue as a range
range [N, N+m], where N is a per-socket counter incremented on each
successful zerocopy send call.

* Performance

The below table shows cycles reported by perf for a netperf process
sending a single 10 Gbps TCP_STREAM. The first three columns show
Mcycles spent in the netperf process context. The second three columns
show time spent systemwide (-a -C A,B) on the two cpus that run the
process and interrupt handler. Reported is the median of at least 3
runs. std is a standard netperf, zc uses zerocopy and % is the ratio.
Netperf is pinned to cpu 2, network interrupts to cpu3, rps and rfs
are disabled and the kernel is booted with idle=halt.

NETPERF=./netperf -t TCP_STREAM -H $host -T 2 -l 30 -- -m $size

perf stat -e cycles $NETPERF
perf stat -C 2,3 -a -e cycles $NETPERF

--process cycles--  cpu cycles
   std  zc   %  std zc   %
4K  27,609  11,217  41  49,217  39,175  79
16K 21,370   3,823  18  43,540  29,213  67
64K 20,557   2,312  11  42,189  26,910  64
256K21,110   2,134  10  43,006  27,104  63
1M  20,987   1,610   8  42,759  25,931  61

Perf record indicates the main source of these differences. Process
cycles only at 1M writes (perf record; perf report -n):

std:
Samples: 42K of event 'cycles', Event count (approx.): 21258597313  
 
 79.41% 33884  netperf  [kernel.kallsyms]  [k] copy_user_generic_string
  3.27%  1396