Re: [bpf-next PATCH v2 05/18] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-16 Thread John Fastabend
On 03/15/2018 05:37 PM, Daniel Borkmann wrote:
> On 03/16/2018 12:06 AM, Alexei Starovoitov wrote:
>> On Thu, Mar 15, 2018 at 11:55:39PM +0100, Daniel Borkmann wrote:
>>> On 03/15/2018 11:20 PM, Alexei Starovoitov wrote:
 On Thu, Mar 15, 2018 at 11:17:12PM +0100, Daniel Borkmann wrote:
> On 03/15/2018 10:59 PM, Alexei Starovoitov wrote:
>> On Mon, Mar 12, 2018 at 12:23:29PM -0700, John Fastabend wrote:
>>>  
>>> +/* User return codes for SK_MSG prog type. */
>>> +enum sk_msg_action {
>>> +   SK_MSG_DROP = 0,
>>> +   SK_MSG_PASS,
>>> +};
>>
>> do we really need new enum here?
>> It's the same as 'enum sk_action' and SK_DROP == SK_MSG_DROP
>> and there will be only drop/pass in both enums.
>> Also I don't see where these two new SK_MSG_* are used...
>>
>>> +
>>> +/* user accessible metadata for SK_MSG packet hook, new fields must
>>> + * be added to the end of this structure
>>> + */
>>> +struct sk_msg_md {
>>> +   __u32 data;
>>> +   __u32 data_end;
>>> +};
>>
>> I think it's time for me to ask for forgiveness :)
>
> :-)
>
>> I used __u32 for data and data_end only because all other fields
>> in __sk_buff were __u32 at the time and I couldn't easily figure out
>> how to teach verifier to recognize 8-byte rewrites.
>> Unfortunately my mistake stuck and was copied over into xdp.
>> Since this is new struct let's do it right and add
>> 'void *data, *data_end' here,
>> since bpf prog will use them as 'void *' pointers.
>> There are no compat issues here, since bpf is always 64-bit.
>
> But at least offset-wise when you do the ctx rewrite this would then
> be a bit more tricky when you have 64 bit kernel with 32 bit user
> space since void * members are in each cases at different offset. So
> unless I'm missing something, this still should either be __u32 or
> __u64 instead of void *, no?

 there is no 32-bit user space. these structs are seen by bpf progs only
 and bpf is 64-bit only too.
 unless I'm missing your point.
>>>
>>> Ok, so lets say you have 32 bit LLVM binary and compile the prog where
>>> you access md->data_end. Given the void * in the struct will that access
>>> end up being BPF_W at ctx offset 4 or BPF_DW at ctx offset 8 from clang
>>> perspective (iow, is the back end treating this special and always use
>>> fixed BPF_DW in such case)? If not and it would be the first case with
>>> offset 4, then we could have the case that underlying 64 bit kernel is
>>> expecting ctx offset 8 for doing the md ctx conversion.
>>
>> i'm still not quite following.
>> Whether llvm itself is 32-bit binary or it's arm32 or sprac32 binary
>> doesn't matter. It will produce the same 64-bit bpf code.
>> It will see 'void *' deref from this struct and will emit DW.
>> May be confusion is from newly added -mattr=+alu32 flag?
>> That option doesn't change that sizeof(void*)==8.
>> It only allows backend to emit 32-bit alu insns.
> 
> Ok, so conclusion we had is that while BPF target is unconditionally 64 bit,
> it depends which clang front end you use for compilation wrt structs. E.g.
> on 32 bit native (e.g. arm) clang front end it would compile the ctx void *
> pointers as 4 byte while using clang -target bpf it would compile it as 8
> byte. The native clang front end is needed in case of tracing when accessing
> pt_regs for walking data structures, but not for networking use case, so
> always using -target bpf there is proper way. Meaning there would be no
> confusion on the void * since size will always be 8 regardless of underlying
> arch being 32 or 64 bit or clang/llvm binary being 32 bit on 64 bit kernel.
> Thus, sticking to void * would be fine, but definitely 
> samples/sockmap/Makefile
> must be fixed as well, such that people don't copy it wrongly.
> 
> Cheers,
> Danie
I'll send a fix for sockmap/Makefile then as a separate series. And
go ahead and change this series to use 'void *'.

Thanks for the follow-up on this.
 





Re: [bpf-next PATCH v2 05/18] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-15 Thread Daniel Borkmann
On 03/16/2018 12:06 AM, Alexei Starovoitov wrote:
> On Thu, Mar 15, 2018 at 11:55:39PM +0100, Daniel Borkmann wrote:
>> On 03/15/2018 11:20 PM, Alexei Starovoitov wrote:
>>> On Thu, Mar 15, 2018 at 11:17:12PM +0100, Daniel Borkmann wrote:
 On 03/15/2018 10:59 PM, Alexei Starovoitov wrote:
> On Mon, Mar 12, 2018 at 12:23:29PM -0700, John Fastabend wrote:
>>  
>> +/* User return codes for SK_MSG prog type. */
>> +enum sk_msg_action {
>> +SK_MSG_DROP = 0,
>> +SK_MSG_PASS,
>> +};
>
> do we really need new enum here?
> It's the same as 'enum sk_action' and SK_DROP == SK_MSG_DROP
> and there will be only drop/pass in both enums.
> Also I don't see where these two new SK_MSG_* are used...
>
>> +
>> +/* user accessible metadata for SK_MSG packet hook, new fields must
>> + * be added to the end of this structure
>> + */
>> +struct sk_msg_md {
>> +__u32 data;
>> +__u32 data_end;
>> +};
>
> I think it's time for me to ask for forgiveness :)

 :-)

> I used __u32 for data and data_end only because all other fields
> in __sk_buff were __u32 at the time and I couldn't easily figure out
> how to teach verifier to recognize 8-byte rewrites.
> Unfortunately my mistake stuck and was copied over into xdp.
> Since this is new struct let's do it right and add
> 'void *data, *data_end' here,
> since bpf prog will use them as 'void *' pointers.
> There are no compat issues here, since bpf is always 64-bit.

 But at least offset-wise when you do the ctx rewrite this would then
 be a bit more tricky when you have 64 bit kernel with 32 bit user
 space since void * members are in each cases at different offset. So
 unless I'm missing something, this still should either be __u32 or
 __u64 instead of void *, no?
>>>
>>> there is no 32-bit user space. these structs are seen by bpf progs only
>>> and bpf is 64-bit only too.
>>> unless I'm missing your point.
>>
>> Ok, so lets say you have 32 bit LLVM binary and compile the prog where
>> you access md->data_end. Given the void * in the struct will that access
>> end up being BPF_W at ctx offset 4 or BPF_DW at ctx offset 8 from clang
>> perspective (iow, is the back end treating this special and always use
>> fixed BPF_DW in such case)? If not and it would be the first case with
>> offset 4, then we could have the case that underlying 64 bit kernel is
>> expecting ctx offset 8 for doing the md ctx conversion.
> 
> i'm still not quite following.
> Whether llvm itself is 32-bit binary or it's arm32 or sprac32 binary
> doesn't matter. It will produce the same 64-bit bpf code.
> It will see 'void *' deref from this struct and will emit DW.
> May be confusion is from newly added -mattr=+alu32 flag?
> That option doesn't change that sizeof(void*)==8.
> It only allows backend to emit 32-bit alu insns.

Ok, so conclusion we had is that while BPF target is unconditionally 64 bit,
it depends which clang front end you use for compilation wrt structs. E.g.
on 32 bit native (e.g. arm) clang front end it would compile the ctx void *
pointers as 4 byte while using clang -target bpf it would compile it as 8
byte. The native clang front end is needed in case of tracing when accessing
pt_regs for walking data structures, but not for networking use case, so
always using -target bpf there is proper way. Meaning there would be no
confusion on the void * since size will always be 8 regardless of underlying
arch being 32 or 64 bit or clang/llvm binary being 32 bit on 64 bit kernel.
Thus, sticking to void * would be fine, but definitely samples/sockmap/Makefile
must be fixed as well, such that people don't copy it wrongly.

Cheers,
Daniel


Re: [bpf-next PATCH v2 05/18] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-15 Thread Alexei Starovoitov
On Thu, Mar 15, 2018 at 11:55:39PM +0100, Daniel Borkmann wrote:
> On 03/15/2018 11:20 PM, Alexei Starovoitov wrote:
> > On Thu, Mar 15, 2018 at 11:17:12PM +0100, Daniel Borkmann wrote:
> >> On 03/15/2018 10:59 PM, Alexei Starovoitov wrote:
> >>> On Mon, Mar 12, 2018 at 12:23:29PM -0700, John Fastabend wrote:
>   
>  +/* User return codes for SK_MSG prog type. */
>  +enum sk_msg_action {
>  +SK_MSG_DROP = 0,
>  +SK_MSG_PASS,
>  +};
> >>>
> >>> do we really need new enum here?
> >>> It's the same as 'enum sk_action' and SK_DROP == SK_MSG_DROP
> >>> and there will be only drop/pass in both enums.
> >>> Also I don't see where these two new SK_MSG_* are used...
> >>>
>  +
>  +/* user accessible metadata for SK_MSG packet hook, new fields must
>  + * be added to the end of this structure
>  + */
>  +struct sk_msg_md {
>  +__u32 data;
>  +__u32 data_end;
>  +};
> >>>
> >>> I think it's time for me to ask for forgiveness :)
> >>
> >> :-)
> >>
> >>> I used __u32 for data and data_end only because all other fields
> >>> in __sk_buff were __u32 at the time and I couldn't easily figure out
> >>> how to teach verifier to recognize 8-byte rewrites.
> >>> Unfortunately my mistake stuck and was copied over into xdp.
> >>> Since this is new struct let's do it right and add
> >>> 'void *data, *data_end' here,
> >>> since bpf prog will use them as 'void *' pointers.
> >>> There are no compat issues here, since bpf is always 64-bit.
> >>
> >> But at least offset-wise when you do the ctx rewrite this would then
> >> be a bit more tricky when you have 64 bit kernel with 32 bit user
> >> space since void * members are in each cases at different offset. So
> >> unless I'm missing something, this still should either be __u32 or
> >> __u64 instead of void *, no?
> > 
> > there is no 32-bit user space. these structs are seen by bpf progs only
> > and bpf is 64-bit only too.
> > unless I'm missing your point.
> 
> Ok, so lets say you have 32 bit LLVM binary and compile the prog where
> you access md->data_end. Given the void * in the struct will that access
> end up being BPF_W at ctx offset 4 or BPF_DW at ctx offset 8 from clang
> perspective (iow, is the back end treating this special and always use
> fixed BPF_DW in such case)? If not and it would be the first case with
> offset 4, then we could have the case that underlying 64 bit kernel is
> expecting ctx offset 8 for doing the md ctx conversion.

i'm still not quite following.
Whether llvm itself is 32-bit binary or it's arm32 or sprac32 binary
doesn't matter. It will produce the same 64-bit bpf code.
It will see 'void *' deref from this struct and will emit DW.
May be confusion is from newly added -mattr=+alu32 flag?
That option doesn't change that sizeof(void*)==8.
It only allows backend to emit 32-bit alu insns.



Re: [bpf-next PATCH v2 05/18] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-15 Thread Daniel Borkmann
On 03/15/2018 11:20 PM, Alexei Starovoitov wrote:
> On Thu, Mar 15, 2018 at 11:17:12PM +0100, Daniel Borkmann wrote:
>> On 03/15/2018 10:59 PM, Alexei Starovoitov wrote:
>>> On Mon, Mar 12, 2018 at 12:23:29PM -0700, John Fastabend wrote:
  
 +/* User return codes for SK_MSG prog type. */
 +enum sk_msg_action {
 +  SK_MSG_DROP = 0,
 +  SK_MSG_PASS,
 +};
>>>
>>> do we really need new enum here?
>>> It's the same as 'enum sk_action' and SK_DROP == SK_MSG_DROP
>>> and there will be only drop/pass in both enums.
>>> Also I don't see where these two new SK_MSG_* are used...
>>>
 +
 +/* user accessible metadata for SK_MSG packet hook, new fields must
 + * be added to the end of this structure
 + */
 +struct sk_msg_md {
 +  __u32 data;
 +  __u32 data_end;
 +};
>>>
>>> I think it's time for me to ask for forgiveness :)
>>
>> :-)
>>
>>> I used __u32 for data and data_end only because all other fields
>>> in __sk_buff were __u32 at the time and I couldn't easily figure out
>>> how to teach verifier to recognize 8-byte rewrites.
>>> Unfortunately my mistake stuck and was copied over into xdp.
>>> Since this is new struct let's do it right and add
>>> 'void *data, *data_end' here,
>>> since bpf prog will use them as 'void *' pointers.
>>> There are no compat issues here, since bpf is always 64-bit.
>>
>> But at least offset-wise when you do the ctx rewrite this would then
>> be a bit more tricky when you have 64 bit kernel with 32 bit user
>> space since void * members are in each cases at different offset. So
>> unless I'm missing something, this still should either be __u32 or
>> __u64 instead of void *, no?
> 
> there is no 32-bit user space. these structs are seen by bpf progs only
> and bpf is 64-bit only too.
> unless I'm missing your point.

Ok, so lets say you have 32 bit LLVM binary and compile the prog where
you access md->data_end. Given the void * in the struct will that access
end up being BPF_W at ctx offset 4 or BPF_DW at ctx offset 8 from clang
perspective (iow, is the back end treating this special and always use
fixed BPF_DW in such case)? If not and it would be the first case with
offset 4, then we could have the case that underlying 64 bit kernel is
expecting ctx offset 8 for doing the md ctx conversion.


Re: [bpf-next PATCH v2 05/18] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-15 Thread Alexei Starovoitov
On Thu, Mar 15, 2018 at 11:17:12PM +0100, Daniel Borkmann wrote:
> On 03/15/2018 10:59 PM, Alexei Starovoitov wrote:
> > On Mon, Mar 12, 2018 at 12:23:29PM -0700, John Fastabend wrote:
> >>  
> >> +/* User return codes for SK_MSG prog type. */
> >> +enum sk_msg_action {
> >> +  SK_MSG_DROP = 0,
> >> +  SK_MSG_PASS,
> >> +};
> > 
> > do we really need new enum here?
> > It's the same as 'enum sk_action' and SK_DROP == SK_MSG_DROP
> > and there will be only drop/pass in both enums.
> > Also I don't see where these two new SK_MSG_* are used...
> > 
> >> +
> >> +/* user accessible metadata for SK_MSG packet hook, new fields must
> >> + * be added to the end of this structure
> >> + */
> >> +struct sk_msg_md {
> >> +  __u32 data;
> >> +  __u32 data_end;
> >> +};
> > 
> > I think it's time for me to ask for forgiveness :)
> 
> :-)
> 
> > I used __u32 for data and data_end only because all other fields
> > in __sk_buff were __u32 at the time and I couldn't easily figure out
> > how to teach verifier to recognize 8-byte rewrites.
> > Unfortunately my mistake stuck and was copied over into xdp.
> > Since this is new struct let's do it right and add
> > 'void *data, *data_end' here,
> > since bpf prog will use them as 'void *' pointers.
> > There are no compat issues here, since bpf is always 64-bit.
> 
> But at least offset-wise when you do the ctx rewrite this would then
> be a bit more tricky when you have 64 bit kernel with 32 bit user
> space since void * members are in each cases at different offset. So
> unless I'm missing something, this still should either be __u32 or
> __u64 instead of void *, no?

there is no 32-bit user space. these structs are seen by bpf progs only
and bpf is 64-bit only too.
unless I'm missing your point.



Re: [bpf-next PATCH v2 05/18] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-15 Thread Daniel Borkmann
On 03/15/2018 10:59 PM, Alexei Starovoitov wrote:
> On Mon, Mar 12, 2018 at 12:23:29PM -0700, John Fastabend wrote:
>>  
>> +/* User return codes for SK_MSG prog type. */
>> +enum sk_msg_action {
>> +SK_MSG_DROP = 0,
>> +SK_MSG_PASS,
>> +};
> 
> do we really need new enum here?
> It's the same as 'enum sk_action' and SK_DROP == SK_MSG_DROP
> and there will be only drop/pass in both enums.
> Also I don't see where these two new SK_MSG_* are used...
> 
>> +
>> +/* user accessible metadata for SK_MSG packet hook, new fields must
>> + * be added to the end of this structure
>> + */
>> +struct sk_msg_md {
>> +__u32 data;
>> +__u32 data_end;
>> +};
> 
> I think it's time for me to ask for forgiveness :)

:-)

> I used __u32 for data and data_end only because all other fields
> in __sk_buff were __u32 at the time and I couldn't easily figure out
> how to teach verifier to recognize 8-byte rewrites.
> Unfortunately my mistake stuck and was copied over into xdp.
> Since this is new struct let's do it right and add
> 'void *data, *data_end' here,
> since bpf prog will use them as 'void *' pointers.
> There are no compat issues here, since bpf is always 64-bit.

But at least offset-wise when you do the ctx rewrite this would then
be a bit more tricky when you have 64 bit kernel with 32 bit user
space since void * members are in each cases at different offset. So
unless I'm missing something, this still should either be __u32 or
__u64 instead of void *, no?

>> +static int bpf_map_msg_verdict(int _rc, struct sk_msg_buff *md)
>> +{
>> +return ((_rc == SK_PASS) ?
>> +   (md->map ? __SK_REDIRECT : __SK_PASS) :
>> +   __SK_DROP);
> 
> you're using old SK_PASS here too ;)
> that's to my point of not adding SK_MSG_PASS...
> 
> Overall the patch set looks absolutely great.
> Thank you for working on it.

+1


Re: [bpf-next PATCH v2 05/18] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-15 Thread John Fastabend
On 03/15/2018 02:59 PM, Alexei Starovoitov wrote:
> On Mon, Mar 12, 2018 at 12:23:29PM -0700, John Fastabend wrote:
>>  
>> +/* User return codes for SK_MSG prog type. */
>> +enum sk_msg_action {
>> +SK_MSG_DROP = 0,
>> +SK_MSG_PASS,
>> +};
> 
> do we really need new enum here?

Nope and as you noticed the actual code uses the
SK_{DROP|PASS} enum. Will remove this.

> It's the same as 'enum sk_action' and SK_DROP == SK_MSG_DROP
> and there will be only drop/pass in both enums.
> Also I don't see where these two new SK_MSG_* are used...
> 
>> +
>> +/* user accessible metadata for SK_MSG packet hook, new fields must
>> + * be added to the end of this structure
>> + */
>> +struct sk_msg_md {
>> +__u32 data;
>> +__u32 data_end;
>> +};
> 
> I think it's time for me to ask for forgiveness :)
> I used __u32 for data and data_end only because all other fields
> in __sk_buff were __u32 at the time and I couldn't easily figure out
> how to teach verifier to recognize 8-byte rewrites.
> Unfortunately my mistake stuck and was copied over into xdp.
> Since this is new struct let's do it right and add
> 'void *data, *data_end' here,
> since bpf prog will use them as 'void *' pointers.
> There are no compat issues here, since bpf is always 64-bit.
> 

aha nice catch. Yep lets use 'void*' here. I had forgot about
that discussion and copied them here as well.

>> +static int bpf_map_msg_verdict(int _rc, struct sk_msg_buff *md)
>> +{
>> +return ((_rc == SK_PASS) ?
>> +   (md->map ? __SK_REDIRECT : __SK_PASS) :
>> +   __SK_DROP);
> 
> you're using old SK_PASS here too ;)
> that's to my point of not adding SK_MSG_PASS...
> 

+1

> Overall the patch set looks absolutely great.
> Thank you for working on it.
> 

I'll fixup a few of these small things now and should have
a v3 shortly.


Re: [bpf-next PATCH v2 05/18] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-15 Thread Alexei Starovoitov
On Mon, Mar 12, 2018 at 12:23:29PM -0700, John Fastabend wrote:
>  
> +/* User return codes for SK_MSG prog type. */
> +enum sk_msg_action {
> + SK_MSG_DROP = 0,
> + SK_MSG_PASS,
> +};

do we really need new enum here?
It's the same as 'enum sk_action' and SK_DROP == SK_MSG_DROP
and there will be only drop/pass in both enums.
Also I don't see where these two new SK_MSG_* are used...

> +
> +/* user accessible metadata for SK_MSG packet hook, new fields must
> + * be added to the end of this structure
> + */
> +struct sk_msg_md {
> + __u32 data;
> + __u32 data_end;
> +};

I think it's time for me to ask for forgiveness :)
I used __u32 for data and data_end only because all other fields
in __sk_buff were __u32 at the time and I couldn't easily figure out
how to teach verifier to recognize 8-byte rewrites.
Unfortunately my mistake stuck and was copied over into xdp.
Since this is new struct let's do it right and add
'void *data, *data_end' here,
since bpf prog will use them as 'void *' pointers.
There are no compat issues here, since bpf is always 64-bit.

> +static int bpf_map_msg_verdict(int _rc, struct sk_msg_buff *md)
> +{
> + return ((_rc == SK_PASS) ?
> +(md->map ? __SK_REDIRECT : __SK_PASS) :
> +__SK_DROP);

you're using old SK_PASS here too ;)
that's to my point of not adding SK_MSG_PASS...

Overall the patch set looks absolutely great.
Thank you for working on it.



Re: [bpf-next PATCH v2 05/18] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-15 Thread David Miller
From: John Fastabend 
Date: Mon, 12 Mar 2018 12:23:29 -0700

> This implements a BPF ULP layer to allow policy enforcement and
> monitoring at the socket layer. In order to support this a new
> program type BPF_PROG_TYPE_SK_MSG is used to run the policy at
> the sendmsg/sendpage hook. To attach the policy to sockets a
> sockmap is used with a new program attach type BPF_SK_MSG_VERDICT.
 ...
> Signed-off-by: John Fastabend 

Acked-by: David S. Miller 


[bpf-next PATCH v2 05/18] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-12 Thread John Fastabend
This implements a BPF ULP layer to allow policy enforcement and
monitoring at the socket layer. In order to support this a new
program type BPF_PROG_TYPE_SK_MSG is used to run the policy at
the sendmsg/sendpage hook. To attach the policy to sockets a
sockmap is used with a new program attach type BPF_SK_MSG_VERDICT.

Similar to previous sockmap usages when a sock is added to a
sockmap, via a map update, if the map contains a BPF_SK_MSG_VERDICT
program type attached then the BPF ULP layer is created on the
socket and the attached BPF_PROG_TYPE_SK_MSG program is run for
every msg in sendmsg case and page/offset in sendpage case.

BPF_PROG_TYPE_SK_MSG Semantics/API:

BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and
SK_DROP. Returning SK_DROP free's the copied data in the sendmsg
case and in the sendpage case leaves the data untouched. Both cases
return -EACESS to the user. Returning SK_PASS will allow the msg to
be sent.

In the sendmsg case data is copied into kernel space buffers before
running the BPF program. The kernel space buffers are stored in a
scatterlist object where each element is a kernel memory buffer.
Some effort is made to coalesce data from the sendmsg call here.
For example a sendmsg call with many one byte iov entries will
likely be pushed into a single entry. The BPF program is run with
data pointers (start/end) pointing to the first sg element.

In the sendpage case data is not copied. We opt not to copy the
data by default here, because the BPF infrastructure does not
know what bytes will be needed nor when they will be needed. So
copying all bytes may be wasteful. Because of this the initial
start/end data pointers are (0,0). Meaning no data can be read or
written. This avoids reading data that may be modified by the
user. A new helper is added later in this series if reading and
writing the data is needed. The helper call will do a copy by
default so that the page is exclusively owned by the BPF call.

The verdict from the BPF_PROG_TYPE_SK_MSG applies to the entire msg
in the sendmsg() case and the entire page/offset in the sendpage case.
This avoids ambiguity on how to handle mixed return codes in the
sendmsg case. Again a helper is added later in the series if
a verdict needs to apply to multiple system calls and/or only
a subpart of the currently being processed message.

The helper msg_redirect_map() can be used to select the socket to
send the data on. This is used similar to existing redirect use
cases. This allows policy to redirect msgs.

Pseudo code simple example:

The basic logic to attach a program to a socket is as follows,

  // load the programs
  bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG,
, _prog);

  // lookup the sockmap
  bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map");

  // get fd for sockmap
  map_fd_msg = bpf_map__fd(bpf_map_msg);

  // attach program to sockmap
  bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0);

Adding sockets to the map is done in the normal way,

  // Add a socket 'fd' to sockmap at location 'i'
  bpf_map_update_elem(map_fd_msg, , fd, BPF_ANY);

After the above any socket attached to "my_sock_map", in this case
'fd', will run the BPF msg verdict program (msg_prog) on every
sendmsg and sendpage system call.

For a complete example see BPF selftests or sockmap samples.

Implementation notes:

It seemed the simplest, to me at least, to use a refcnt to ensure
psock is not lost across the sendmsg copy into the sg, the bpf program
running on the data in sg_data, and the final pass to the TCP stack.
Some performance testing may show a better method to do this and avoid
the refcnt cost, but for now use the simpler method.

Another item that will come after basic support is in place is
supporting MSG_MORE flag. At the moment we call sendpages even if
the MSG_MORE flag is set. An enhancement would be to collect the
pages into a larger scatterlist and pass down the stack. Notice that
bpf_tcp_sendmsg() could support this with some additional state saved
across sendmsg calls. I built the code to support this without having
to do refactoring work. Other features TBD include ZEROCOPY and the
TCP_RECV_QUEUE/TCP_NO_QUEUE support. This will follow initial series
shortly.

Future work could improve size limits on the scatterlist rings used
here. Currently, we use MAX_SKB_FRAGS simply because this was being
used already in the TLS case. Future work could extend the kernel sk
APIs to tune this depending on workload. This is a trade-off
between memory usage and throughput performance.

Signed-off-by: John Fastabend 
---
 include/linux/bpf.h   |1 
 include/linux/bpf_types.h |1 
 include/linux/filter.h|   17 +
 include/uapi/linux/bpf.h  |   28 ++
 kernel/bpf/sockmap.c  |  714 -
 kernel/bpf/syscall.c  |   14 +
 kernel/bpf/verifier.c |5 
 net/core/filter.c |  106 +++
 8 files