Re: [bpf-next PATCH 05/16] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-07 Thread Daniel Borkmann
On 03/07/2018 04:25 AM, John Fastabend wrote:
[...]
> Thought about this a bit more and chatted with Daniel a bit. I think
> a better solution is to set data_start = data_end = 0 by default in the
> sendpage case. This will disallow any read/writes into the sendpage
> data. Then if the user needs to read/write data we can use a helper
> bpf_sk_msg_pull_data(start_byte, end_byte) which can pull the data into a
> linear buffer as needed. This will ensure any user writes will not
> change data after the BPF verdict (your concern). Also it will minimize
> the amount of data that needs to be copied (my concern). In some of my
> use cases where no data is needed we can simple not use the helper. Then
> on the sendmsg side we can continue to set the (data_start, data_end)
> pointers to the first scatterlist element. But, also use this helper to
> set the data pointers past the first scatterlist element if needed. So
> if someone wants to read past the first 4k bytes on a large send for
> example this can be done with the above helper. BPF programs just
> need to check (start,end) data pointers and can be oblivious to
> if the program is being invoked by a call from sendpage or sendmsg.
> 
> I think this is a fairly elegant solution. Finally we can further
> optimize later with a flag if needed to cover the case where we
> want to read lots of bytes but _not_ do the copy. We can debate
> the usefulness of this later with actual perf data.
> 
> All this has the added bonus that all I need is another patch on
> top to add the helper. Pseudo code might look like this,
> 
> my_bpf_prog(struct sk_msg_md *msg) {
>   void *data_end = msg->data_end;
>   void *data_start = msg->data_start;
> 
>   need = PARSE_BYTES;
> 
>   // ensure user actually sent full header
>   if (msg->size < PARSE_BYTES) {
>   bpf_msg_cork(PARSE_BYTES);
>   return SK_DROP;
>   }
> 
>   /* ensure we can read full header, if this is a
>* sendmsg system call AND PARSE_BYTES are all in
>* the first scatterlist elem this is a no-op.
>* If this is a sendpage call will put PARSE_BYTES
>* in a psock buffer to avoid user modifications.
>*/
>   if (data_end - data_start < PARSE_BYTES) {

I think it might need to look like 'data_start + PARSE_BYTES > data_end'
for verifier to recognize (unless LLVM generates code that way).

>   err = bpf_sk_msg_pull_data(0, PARSE_BYTES, flags);
>   if (err)
>   return SK_DROP;

Above should be:

if (unlikely(err || data_start + PARSE_BYTES > data_end))
return SK_DROP;

Here for the successful case, you need to recheck since data pointers
were invalidated due to the helper call. bpf_sk_msg_pull_data() would
for the very first case potentially be called unconditionally at prog
start though when you start out with 0 len anyway, basically right after
msg->size test.

>   }
> 
>   // we have the full header parse it now
>   verdict = my_bpf_header_parser(msg);
>   return verdict;
> }
> 
> Future optimization can work with prologue to pull in bytes
> more efficiently. And for what its worth I found a couple bugs
> in the error path of the sendpage hook I can fix in the v2 as well.
> 
> What do you think? 
> 
> @Daniel, sound more or less like what you were thinking?

Yes, absolutely what I was thinking.

We have exactly the same logic in tc/BPF today for the case when the direct
packet access test fails and we want to pull in skb data from non-linear
area, so we can in such case just call bpf_skb_pull_data(skb, len) and redo
the test to access it privately after that.

Thanks,
Daniel


Re: [bpf-next PATCH 05/16] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-06 Thread David Miller
From: John Fastabend 
Date: Tue, 6 Mar 2018 19:25:01 -0800

> What do you think? 

Sounds good from your description, I can't wait to see it :-)


Re: [bpf-next PATCH 05/16] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-06 Thread John Fastabend
On 03/06/2018 10:18 AM, John Fastabend wrote:
> On 03/06/2018 07:47 AM, David Miller wrote:
>> From: John Fastabend 
>> Date: Mon, 5 Mar 2018 23:06:01 -0800
>>
>>> On 03/05/2018 10:42 PM, David Miller wrote:
 From: John Fastabend 
 Date: Mon, 5 Mar 2018 22:22:21 -0800

> All I meant by this is if an application uses sendfile() call
> there is no good way to know when/if the kernel side will copy or
> xmit the  data. So a reliable user space application will need to
> only modify the data if it "knows" there are no outstanding sends
> in-flight. So if we assume applications follow this then it
> is OK to avoid the copy. Of course this is not good enough for
> security, but for monitoring/statistics (my use case 1 it works).

 For an application implementing a networking file system, it's pretty
 legitimate for file contents to change before the page gets DMA's to
 the networking card.

>>>
>>> Still there are useful BPF programs that can tolerate this. So I
>>> would prefer to allow BPF programs to operate in the no-copy mode
>>> if wanted. It doesn't have to be the default though as it currently
>>> is. A l7 load balancer is a good example of this.
>>
>> Maybe I'd be ok if it were not the default.  But do you really want to
>> expose a potential attack vector, even if the app gets to choose and
>> say "I'm ok"?
>>
> 
> Yes, because I have use cases where I don't need to read the data, but
> have already "approved" the data. One example applications like
> nginx can serve static http data. Just reading over the code what they
> do, when sendfile is enabled, is a sendmsg call with the header. We want
> to enforce the policy on the header. Then we know the next N bytes are
> OK. Nginx will then send the payload over sendfile syscall. We already
> know the data is good from initial sendmsg call the next N bytes can
> get the verdict SK_PASS without even touching the data. If we do a
> copy in this case we see significant performance degradation.
> 
> The other use case is the L7 load balancer mentioned above. If we are
> using RR policies or some other heuristic if the user modifies the
> payload after the BPF verdict that is also fine. A malicious user
> could rewrite the header and try to game the load balancer but the
> BPF program can always just dev/null (SK_DROP) the application when
> it detects this. This also assumes the load balancer is using the
> header for its heuristic some interesting heuristics may not use
> the header at all.
> 
 And that's perfectly fine, and we everything such that this will work
 properly.

 The card checksums what ends up being DMA'd so nothing from the
 networking side is broken.
>>>
>>> Assuming the card has checksum support correct? Which is why we have
>>> the SKBTX_SHARED_FRAG checked in skb_has_shared_frag() and the checksum
>>> helpers called by the drivers when they do not support the protocol
>>> being used. So probably OK assumption if using supported protocols and
>>> hardware? Perhaps in general folks just use normal protocols and
>>> hardware so it works.
>>
>> If the hardware doesn't support the checksums, we linearize the SKB
>> (therefore obtain a snapshot of the data), and checksum.  Exactly what
>> would happen if the hardware did the checksum.
>>
>> So OK in that case too.
>>
>> We always guarantee that you will always get a correct checksum on
>> outgoing packets, even if you modify the page contents meanwhile.
>>
> 
> Agreed the checksum is correct, but the user doesn't know if the linearize
> happened while it was modifying the data, potentially creating data with
> a partial update. Because the user modifying the data doesn't block the
> linearize operation in the kernel and vice versa the linearize operation
> can happen in parallel with the user side data modification. So maybe
> I'm still missing something but it seems the data can be in some unknown
> state on the wire.
> 
> Either way though I think its fine to make the default sendpage hook do
> the copy. A flag to avoid the copy can be added later to resolve my use
> cases above. I'll code this up in a v2 today/tomorrow.

Hi,

Thought about this a bit more and chatted with Daniel a bit. I think
a better solution is to set data_start = data_end = 0 by default in the
sendpage case. This will disallow any read/writes into the sendpage
data. Then if the user needs to read/write data we can use a helper
bpf_sk_msg_pull_data(start_byte, end_byte) which can pull the data into a
linear buffer as needed. This will ensure any user writes will not
change data after the BPF verdict (your concern). Also it will minimize
the amount of data that needs to be copied (my concern). In some of my
use cases where no data is needed we can simple not use the helper. Then
on the sendmsg side we can continue to set the (data_start, data_end)
pointers to the first scatterlist element. 

Re: [bpf-next PATCH 05/16] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-06 Thread John Fastabend
On 03/06/2018 07:47 AM, David Miller wrote:
> From: John Fastabend 
> Date: Mon, 5 Mar 2018 23:06:01 -0800
> 
>> On 03/05/2018 10:42 PM, David Miller wrote:
>>> From: John Fastabend 
>>> Date: Mon, 5 Mar 2018 22:22:21 -0800
>>>
 All I meant by this is if an application uses sendfile() call
 there is no good way to know when/if the kernel side will copy or
 xmit the  data. So a reliable user space application will need to
 only modify the data if it "knows" there are no outstanding sends
 in-flight. So if we assume applications follow this then it
 is OK to avoid the copy. Of course this is not good enough for
 security, but for monitoring/statistics (my use case 1 it works).
>>>
>>> For an application implementing a networking file system, it's pretty
>>> legitimate for file contents to change before the page gets DMA's to
>>> the networking card.
>>>
>>
>> Still there are useful BPF programs that can tolerate this. So I
>> would prefer to allow BPF programs to operate in the no-copy mode
>> if wanted. It doesn't have to be the default though as it currently
>> is. A l7 load balancer is a good example of this.
> 
> Maybe I'd be ok if it were not the default.  But do you really want to
> expose a potential attack vector, even if the app gets to choose and
> say "I'm ok"?
> 

Yes, because I have use cases where I don't need to read the data, but
have already "approved" the data. One example applications like
nginx can serve static http data. Just reading over the code what they
do, when sendfile is enabled, is a sendmsg call with the header. We want
to enforce the policy on the header. Then we know the next N bytes are
OK. Nginx will then send the payload over sendfile syscall. We already
know the data is good from initial sendmsg call the next N bytes can
get the verdict SK_PASS without even touching the data. If we do a
copy in this case we see significant performance degradation.

The other use case is the L7 load balancer mentioned above. If we are
using RR policies or some other heuristic if the user modifies the
payload after the BPF verdict that is also fine. A malicious user
could rewrite the header and try to game the load balancer but the
BPF program can always just dev/null (SK_DROP) the application when
it detects this. This also assumes the load balancer is using the
header for its heuristic some interesting heuristics may not use
the header at all.

>>> And that's perfectly fine, and we everything such that this will work
>>> properly.
>>>
>>> The card checksums what ends up being DMA'd so nothing from the
>>> networking side is broken.
>>
>> Assuming the card has checksum support correct? Which is why we have
>> the SKBTX_SHARED_FRAG checked in skb_has_shared_frag() and the checksum
>> helpers called by the drivers when they do not support the protocol
>> being used. So probably OK assumption if using supported protocols and
>> hardware? Perhaps in general folks just use normal protocols and
>> hardware so it works.
> 
> If the hardware doesn't support the checksums, we linearize the SKB
> (therefore obtain a snapshot of the data), and checksum.  Exactly what
> would happen if the hardware did the checksum.
> 
> So OK in that case too.
> 
> We always guarantee that you will always get a correct checksum on
> outgoing packets, even if you modify the page contents meanwhile.
> 

Agreed the checksum is correct, but the user doesn't know if the linearize
happened while it was modifying the data, potentially creating data with
a partial update. Because the user modifying the data doesn't block the
linearize operation in the kernel and vice versa the linearize operation
can happen in parallel with the user side data modification. So maybe
I'm still missing something but it seems the data can be in some unknown
state on the wire.

Either way though I think its fine to make the default sendpage hook do
the copy. A flag to avoid the copy can be added later to resolve my use
cases above. I'll code this up in a v2 today/tomorrow.

>> So the "I need at least X more bytes" is the msg_cork_bytes() in patch
>> 7. I could handle the sendpage case the same as I handle the sendmsg
>> case and copy the data into the buffer until N bytes are received. I
>> had planned to add this mode in a follow up series but could add it in
>> this series so we have all the pieces in one submission.
>>
>> Although I used a scatterlist instead of a linear buffer. I was
>> planning to add a helper to pull in next sg list item if needed
>> rather than try to allocate a large linear block up front.
> 
> For non-deep packet inspection cases this re-running of the parser case
> will probably not trigger at all.
> 

Agreed, its mostly there to handle cases where the sendmsg call
only sent part of a application (kafka, http, etc) header. This can
happen if user is sending multiple messages in a single sendmsg/sendfile
call. But, yeah I see it rarely in 

Re: [bpf-next PATCH 05/16] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-06 Thread David Miller
From: John Fastabend 
Date: Mon, 5 Mar 2018 23:06:01 -0800

> On 03/05/2018 10:42 PM, David Miller wrote:
>> From: John Fastabend 
>> Date: Mon, 5 Mar 2018 22:22:21 -0800
>> 
>>> All I meant by this is if an application uses sendfile() call
>>> there is no good way to know when/if the kernel side will copy or
>>> xmit the  data. So a reliable user space application will need to
>>> only modify the data if it "knows" there are no outstanding sends
>>> in-flight. So if we assume applications follow this then it
>>> is OK to avoid the copy. Of course this is not good enough for
>>> security, but for monitoring/statistics (my use case 1 it works).
>> 
>> For an application implementing a networking file system, it's pretty
>> legitimate for file contents to change before the page gets DMA's to
>> the networking card.
>> 
> 
> Still there are useful BPF programs that can tolerate this. So I
> would prefer to allow BPF programs to operate in the no-copy mode
> if wanted. It doesn't have to be the default though as it currently
> is. A l7 load balancer is a good example of this.

Maybe I'd be ok if it were not the default.  But do you really want to
expose a potential attack vector, even if the app gets to choose and
say "I'm ok"?

>> And that's perfectly fine, and we everything such that this will work
>> properly.
>> 
>> The card checksums what ends up being DMA'd so nothing from the
>> networking side is broken.
> 
> Assuming the card has checksum support correct? Which is why we have
> the SKBTX_SHARED_FRAG checked in skb_has_shared_frag() and the checksum
> helpers called by the drivers when they do not support the protocol
> being used. So probably OK assumption if using supported protocols and
> hardware? Perhaps in general folks just use normal protocols and
> hardware so it works.

If the hardware doesn't support the checksums, we linearize the SKB
(therefore obtain a snapshot of the data), and checksum.  Exactly what
would happen if the hardware did the checksum.

So OK in that case too.

We always guarantee that you will always get a correct checksum on
outgoing packets, even if you modify the page contents meanwhile.

> So the "I need at least X more bytes" is the msg_cork_bytes() in patch
> 7. I could handle the sendpage case the same as I handle the sendmsg
> case and copy the data into the buffer until N bytes are received. I
> had planned to add this mode in a follow up series but could add it in
> this series so we have all the pieces in one submission.
> 
> Although I used a scatterlist instead of a linear buffer. I was
> planning to add a helper to pull in next sg list item if needed
> rather than try to allocate a large linear block up front.

For non-deep packet inspection cases this re-running of the parser case
will probably not trigger at all.


Re: [bpf-next PATCH 05/16] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-05 Thread John Fastabend
On 03/05/2018 10:42 PM, David Miller wrote:
> From: John Fastabend 
> Date: Mon, 5 Mar 2018 22:22:21 -0800
> 
>> All I meant by this is if an application uses sendfile() call
>> there is no good way to know when/if the kernel side will copy or
>> xmit the  data. So a reliable user space application will need to
>> only modify the data if it "knows" there are no outstanding sends
>> in-flight. So if we assume applications follow this then it
>> is OK to avoid the copy. Of course this is not good enough for
>> security, but for monitoring/statistics (my use case 1 it works).
> 
> For an application implementing a networking file system, it's pretty
> legitimate for file contents to change before the page gets DMA's to
> the networking card.
> 

Still there are useful BPF programs that can tolerate this. So I
would prefer to allow BPF programs to operate in the no-copy mode
if wanted. It doesn't have to be the default though as it currently
is. A l7 load balancer is a good example of this.

> And that's perfectly fine, and we everything such that this will work
> properly.
> 
> The card checksums what ends up being DMA'd so nothing from the
> networking side is broken.

Assuming the card has checksum support correct? Which is why we have
the SKBTX_SHARED_FRAG checked in skb_has_shared_frag() and the checksum
helpers called by the drivers when they do not support the protocol
being used. So probably OK assumption if using supported protocols and
hardware? Perhaps in general folks just use normal protocols and
hardware so it works.

> 
> So this assumption you mention really does not hold.
> 

OK.

> There needs to be some feedback from the BPF program that parses the
> packet.  This way it can say, "I need at least X more bytes before I
> can generate a verdict".  And you keep copying more and more bytes
> into a linear buffer and calling the parser over and over until it can
> generate a full verdict or you run out of networking data.
> 

So the "I need at least X more bytes" is the msg_cork_bytes() in patch
7. I could handle the sendpage case the same as I handle the sendmsg
case and copy the data into the buffer until N bytes are received. I
had planned to add this mode in a follow up series but could add it in
this series so we have all the pieces in one submission.

Although I used a scatterlist instead of a linear buffer. I was
planning to add a helper to pull in next sg list item if needed
rather than try to allocate a large linear block up front.


Re: [bpf-next PATCH 05/16] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-05 Thread David Miller
From: John Fastabend 
Date: Mon, 5 Mar 2018 22:22:21 -0800

> All I meant by this is if an application uses sendfile() call
> there is no good way to know when/if the kernel side will copy or
> xmit the  data. So a reliable user space application will need to
> only modify the data if it "knows" there are no outstanding sends
> in-flight. So if we assume applications follow this then it
> is OK to avoid the copy. Of course this is not good enough for
> security, but for monitoring/statistics (my use case 1 it works).

For an application implementing a networking file system, it's pretty
legitimate for file contents to change before the page gets DMA's to
the networking card.

And that's perfectly fine, and we everything such that this will work
properly.

The card checksums what ends up being DMA'd so nothing from the
networking side is broken.

So this assumption you mention really does not hold.

There needs to be some feedback from the BPF program that parses the
packet.  This way it can say, "I need at least X more bytes before I
can generate a verdict".  And you keep copying more and more bytes
into a linear buffer and calling the parser over and over until it can
generate a full verdict or you run out of networking data.


Re: [bpf-next PATCH 05/16] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-05 Thread John Fastabend
On 03/05/2018 09:42 PM, David Miller wrote:
> From: John Fastabend 
> Date: Mon, 5 Mar 2018 14:53:08 -0800
> 
>> I decided to make the default no-copy to mirror the existing
>> sendpage() semantics and then to add the flag later. The flag
>> support is not in this series simply because I wanted to get the
>> base support in first.
> 
> What existing sendpage semantics are you referring to?
> 

All I meant by this is if an application uses sendfile() call
there is no good way to know when/if the kernel side will copy or
xmit the  data. So a reliable user space application will need to
only modify the data if it "knows" there are no outstanding sends
in-flight. So if we assume applications follow this then it
is OK to avoid the copy. Of course this is not good enough for
security, but for monitoring/statistics (my use case 1 it works).

By keep existing sendpage semantics I just meant applications
should already follow the above.


Re: [bpf-next PATCH 05/16] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-05 Thread David Miller
From: John Fastabend 
Date: Mon, 5 Mar 2018 14:53:08 -0800

> I decided to make the default no-copy to mirror the existing
> sendpage() semantics and then to add the flag later. The flag
> support is not in this series simply because I wanted to get the
> base support in first.

What existing sendpage semantics are you referring to?


Re: [bpf-next PATCH 05/16] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-05 Thread John Fastabend
On 03/05/2018 01:40 PM, David Miller wrote:
> From: John Fastabend 
> Date: Mon, 05 Mar 2018 11:51:22 -0800
> 
>> BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and
>> SK_DROP. Returning SK_DROP free's the copied data in the sendmsg
>> case and in the sendpage case leaves the data untouched. Both cases
>> return -EACESS to the user. Returning SK_PASS will allow the msg to
>> be sent.
>>
>> In the sendmsg case data is copied into kernel space buffers before
>> running the BPF program. In the sendpage case data is never copied.
>> The implication being users may change data after BPF programs run in
>> the sendpage case. (A flag will be added to always copy shortly
>> if the copy must always be performed).
> 
> I don't see how the sendpage case can be right.
> 
> The user can asynchronously change the page contents whenever they
> want, and if the BPF program runs on the old contents then the verdict
> is not for what actually ends up being sent on the socket> 
> There is really no way to cheaply freeze the page contents other than
> to make a copy.
> 

Right, so we have two cases. The first is we are not trying to protect
against malicious users but merely monitor the connection. This case
is primarily for L7 statistics, number of bytes sent to URL foo
for example. If users are changing data (for a real program not something
malicious) mid sendfile() this is really buggy anyways. There is no way to
know when/if the data is being copied lower in the stack. Even worse would
be if it changed a msg header, such as the http or kafka header, then
I don't see how such a program would work reliable at all. Some of my
L7 monitoring BPF programs fall into this category.

The second case is we want to implement a strict policy. For example
never allow user 'bar' to send to URL foo. In the current patches this
would be vulnerable to async data changes. I was planning to have a follow
up patch to this series to add a flag "always copy" which handles the
asynchronous case by always copying the data if the BPF policy can
not tolerate user changing data mid-send. Another class of BPF programs
I have fall into this bucket.

However, the performance cost of copy can be significant so allowing the
BPF policy to decide which mode they require seems best to me. I decided
to make the default no-copy to mirror the existing sendpage() semantics
and then to add the flag later. The flag support is not in this series
simply because I wanted to get the base support in first.

Make sense? The default could be to copy sendpage data and then a
flag could be made to allow it to skip the copy. But I prefer the
current defaults.

Thanks,
John



Re: [bpf-next PATCH 05/16] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-05 Thread David Miller
From: John Fastabend 
Date: Mon, 05 Mar 2018 11:51:22 -0800

> BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and
> SK_DROP. Returning SK_DROP free's the copied data in the sendmsg
> case and in the sendpage case leaves the data untouched. Both cases
> return -EACESS to the user. Returning SK_PASS will allow the msg to
> be sent.
> 
> In the sendmsg case data is copied into kernel space buffers before
> running the BPF program. In the sendpage case data is never copied.
> The implication being users may change data after BPF programs run in
> the sendpage case. (A flag will be added to always copy shortly
> if the copy must always be performed).

I don't see how the sendpage case can be right.

The user can asynchronously change the page contents whenever they
want, and if the BPF program runs on the old contents then the verdict
is not for what actually ends up being sent on the socket.

There is really no way to cheaply freeze the page contents other than
to make a copy.