Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind

2018-03-15 Thread महेश बंडेवार
On Wed, Mar 14, 2018 at 8:37 PM, Alexei Starovoitov
 wrote:
> On Wed, Mar 14, 2018 at 05:17:54PM -0700, Eric Dumazet wrote:
>>
>>
>> On 03/14/2018 11:41 AM, Alexei Starovoitov wrote:
>> > On Wed, Mar 14, 2018 at 11:00 AM, Alexei Starovoitov
>> >  wrote:
>> >>
>> >>> It seems this is exactly the case where a netns would be the correct 
>> >>> answer.
>> >>
>> >> Unfortuantely that's not the case. That's what I tried to explain
>> >> in the cover letter:
>> >> "The setup involves per-container IPs, policy, etc, so traditional
>> >> network-only solutions that involve VRFs, netns, acls are not applicable."
>> >> To elaborate more on that:
>> >> netns is l2 isolation.
>> >> vrf is l3 isolation.
>> >> whereas to containerize an application we need to punch connectivity holes
>> >> in these layered techniques.
>> >> We also considered resurrecting Hannes's afnetns work
>> >> and even went as far as designing a new namespace for L4 isolation.
>> >> Unfortunately all hierarchical namespace abstraction don't work.
>> >> To run an application inside cgroup container that was not written
>> >> with containers in mind we have to make an illusion of running
>> >> in non-containerized environment.
>> >> In some cases we remember the port and container id in the post-bind hook
>> >> in a bpf map and when some other task in a different container is trying
>> >> to connect to a service we need to know where this service is running.
>> >> It can be remote and can be local. Both client and service may or may not
>> >> be written with containers in mind and this sockaddr rewrite is providing
>> >> connectivity and load balancing feature that you simply cannot do
>> >> with hierarchical networking primitives.
>> >
>> > have to explain this a bit further...
>> > We also considered hacking these 'connectivity holes' in
>> > netns and/or vrf, but that would be real layering violation,
>> > since clean l2, l3 abstraction would suddenly support
>> > something that breaks through the layers.
>> > Just like many consider ipvlan a bad hack that punches
>> > through the layers and connects l2 abstraction of netns
>> > at l3 layer, this is not something kernel should ever do.
>> > We really didn't want another ipvlan-like hack in the kernel.
>> > Instead bpf programs at bind/connect time _help_
>> > applications discover and connect to each other.
>> > All containers are running in init_nens and there are no vrfs.
>> > After bind/connect the normal fib/neighbor core networking
>> > logic works as it should always do. The whole system is
>> > clean from network point of view.
>>
>>
>> We apparently missed something when deploying ipvlan and one netns per
>> container/job
>
> Hanness expressed the reasons why RHEL doesn't support ipvlan long ago.

I had a long discussion with Hanness and there are two pending  issues
(discounting minor bug fixes / improvement). (a) the
multicast-group-membership and (b) early demux.
multicast group membership is just a matter of putting some code there
to fix it. While early-demux is little harder without violating
isolation boundaries. To me isolation is critical / important and if
we find a right solution that doesn't violate isolation, we'd fix it.

> I couldn't find the complete link. This one mentions some of the issues:
> https://www.mail-archive.com/netdev@vger.kernel.org/msg157614.html
> Since ipvlan works for you, great, but it's clearly a layering violation.
> ipvlan connects L2 namespaces via L3 by doing its own fib lookups.
> To me it's a definition 'punch connectivity hole' in L2 abstraction.
> In normal L2 setup of netns+veth the traffic from one netns should
> have went into another netns via full L2. ipvlan cheats by giving
> L3 connectivity. It's not clean to me.

IPvlan supports three different modes and you have mixed all of them
while explaining your understanding of IPvlan. Probably one needs to
digest all these modes and evaluate them in the context of their use
case. Well, I'm not even going to attempt to explain the differences,
if you were serious you could have figured it out.

There are lots of use cases and people use it in interesting ways.
Each  case can be better handled by using either VRF, or macvlan, or
IPvlan or whatever is out there. It would be childish to say one
use-case is better than others as these are *different* use cases. All
these solutions come with their own caveats and you choose what you
can live with. Well, you can always improve and I can see Redhat folks
are doing it and I appreciate their efforts.

Like I said there are several different ways to make this work with
namespaces in much cleaner way and IPvlan does not need to be
involved. However adding another eBPF hook just because we can in a
hackish way is *not* the right way. Especially when a problem has
already been solved (with namespace) these 2000 lines dont deserve to
be in kernel. eBPF is a good tool and there is a thin line between
using it appropriately and misusing it. I don't want to argue

Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind

2018-03-15 Thread Jiri Benc
On Wed, 14 Mar 2018 20:37:00 -0700, Alexei Starovoitov wrote:
> Hanness expressed the reasons why RHEL doesn't support ipvlan long ago.
> I couldn't find the complete link. This one mentions some of the issues:
> https://www.mail-archive.com/netdev@vger.kernel.org/msg157614.html

ipvlan improved a lot since that time :-) And Paolo has recently fixed
the remaining issues we were aware of.

 Jiri


Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind

2018-03-14 Thread Alexei Starovoitov
On Wed, Mar 14, 2018 at 05:17:54PM -0700, Eric Dumazet wrote:
> 
> 
> On 03/14/2018 11:41 AM, Alexei Starovoitov wrote:
> > On Wed, Mar 14, 2018 at 11:00 AM, Alexei Starovoitov
> >  wrote:
> >>
> >>> It seems this is exactly the case where a netns would be the correct 
> >>> answer.
> >>
> >> Unfortuantely that's not the case. That's what I tried to explain
> >> in the cover letter:
> >> "The setup involves per-container IPs, policy, etc, so traditional
> >> network-only solutions that involve VRFs, netns, acls are not applicable."
> >> To elaborate more on that:
> >> netns is l2 isolation.
> >> vrf is l3 isolation.
> >> whereas to containerize an application we need to punch connectivity holes
> >> in these layered techniques.
> >> We also considered resurrecting Hannes's afnetns work
> >> and even went as far as designing a new namespace for L4 isolation.
> >> Unfortunately all hierarchical namespace abstraction don't work.
> >> To run an application inside cgroup container that was not written
> >> with containers in mind we have to make an illusion of running
> >> in non-containerized environment.
> >> In some cases we remember the port and container id in the post-bind hook
> >> in a bpf map and when some other task in a different container is trying
> >> to connect to a service we need to know where this service is running.
> >> It can be remote and can be local. Both client and service may or may not
> >> be written with containers in mind and this sockaddr rewrite is providing
> >> connectivity and load balancing feature that you simply cannot do
> >> with hierarchical networking primitives.
> > 
> > have to explain this a bit further...
> > We also considered hacking these 'connectivity holes' in
> > netns and/or vrf, but that would be real layering violation,
> > since clean l2, l3 abstraction would suddenly support
> > something that breaks through the layers.
> > Just like many consider ipvlan a bad hack that punches
> > through the layers and connects l2 abstraction of netns
> > at l3 layer, this is not something kernel should ever do.
> > We really didn't want another ipvlan-like hack in the kernel.
> > Instead bpf programs at bind/connect time _help_
> > applications discover and connect to each other.
> > All containers are running in init_nens and there are no vrfs.
> > After bind/connect the normal fib/neighbor core networking
> > logic works as it should always do. The whole system is
> > clean from network point of view.
> 
> 
> We apparently missed something when deploying ipvlan and one netns per
> container/job

Hanness expressed the reasons why RHEL doesn't support ipvlan long ago.
I couldn't find the complete link. This one mentions some of the issues:
https://www.mail-archive.com/netdev@vger.kernel.org/msg157614.html
Since ipvlan works for you, great, but it's clearly a layering violation.
ipvlan connects L2 namespaces via L3 by doing its own fib lookups.
To me it's a definition 'punch connectivity hole' in L2 abstraction.
In normal L2 setup of netns+veth the traffic from one netns should
have went into another netns via full L2. ipvlan cheats by giving
L3 connectivity. It's not clean to me. There are still neighbour
tables in netnses that are duplicated.
Because netns is L2 there is full requeuing for traffic across netnses.
I guess google doesn't prioritize container to container traffic
while outside into netns via ipvlan works ok similar to bond, but
imo it's cheating too.
imo afnetns would have been much better alternative for your
use case without ipvlan pitfalls, but as you said ipvlan already
in the tree and afnetns is not.
With afnetns early demux would have worked not only for traffic from
the network, but for traffic across afnetns-es.

> I find netns isolation very clean, powerful, and it is there already.

netns+veth is a clean abstraction, but netns+ipvlan is imo not.
imo VRF is another clean L3 abstraction. Yet some folks tried
to do VRF-like things with netns.
David Ahern wrote nice blog about issues with that.
I suspect VRF also could have worked for google use case
and would have been easier to use than netns+ipvlan.
But since ipvlan works for you in the current shape, great,
I'm not going to argue further.
Let's agree to disagree on cleanliness of the solution.

> It also works with UDP just fine. Are you considering adding a hook
> later for sendmsg() (unconnected socket or not), or do you want to use
> the existing one in ip_finish_output(), adding per-packet overhead ?

Currently that's indeed the case. Existing cgroup-bpf hooks
at ip_finish_output work for many use cases, but per-packet overhead
is bad. With bind/connect hooks we avoid that overhead for
good traffic (which is tcp and connected udp). We still need
to solve it for unconnected udp. Rough idea is to do similar
sockaddr rewrite/drop in unconnected part of udp_sendmsg.



Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind

2018-03-14 Thread Alexei Starovoitov

On 3/14/18 4:27 PM, Daniel Borkmann wrote:

On 03/14/2018 07:11 PM, Alexei Starovoitov wrote:

On Wed, Mar 14, 2018 at 03:37:01PM +0100, Daniel Borkmann wrote:

--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -133,6 +133,8 @@ enum bpf_prog_type {
BPF_PROG_TYPE_SOCK_OPS,
BPF_PROG_TYPE_SK_SKB,
BPF_PROG_TYPE_CGROUP_DEVICE,
+   BPF_PROG_TYPE_CGROUP_INET4_BIND,
+   BPF_PROG_TYPE_CGROUP_INET6_BIND,


Could those all be merged into BPF_PROG_TYPE_SOCK_OPS? I'm slowly getting
confused by the many sock_*/sk_* prog types we have. The attach hook could
still be something like BPF_CGROUP_BIND/BPF_CGROUP_CONNECT. Potentially
storing some prog-type specific void *private_data in prog's aux during
verification could be a way (similarly as you mention) which can later be
retrieved at attach time to reject with -ENOTSUPP or such.


that's exacly what I mentioned in the cover letter,
but we need to extend attach cmd with verifier-like log_buf+log_size.
since simple enotsupp will be impossible to debug.


Hmm, adding verifier-like log_buf + log_size feels a bit of a kludge just
for this case, but it's the usual problem where getting a std error code
is like looking into a crystal ball to figure where it's coming from. I'd see
only couple of other alternatives: distinct error code like ENAVAIL for such
mismatch, or telling the verifier upfront where this is going to be attached
to - same as we do with the dev for offloading or as you did with splitting
the prog types or some sort of notion of sub-prog types; or having a query
API that returns possible/compatible attach types for this loaded prog via
e.g. bpf_prog_get_info_by_fd() that loader can precheck or check when error
occurs. All nothing really nice, though.


query after loading isn't great, since possible attach combinations
will be too high for human to understand,
but I like "passing attach_type into prog_load" idea.
That should work and it fits existing prog_ifindex too.
So we'll add '__u32 attach_type' to prog_load cmd.
elf loader would still need to parse section name to
figure out prog type and attach type.
Something like:
SEC("sock_addr/bind_v4") my_prog(struct bpf_sock_addr *ctx)
SEC("sock_addr/connect_v6") my_prog(struct bpf_sock_addr *ctx)
We still need new prog type for bind_v4/bind_v6/connect_v4/connect_v6
hooks with distinct 'struct bpf_sock_addr' context,
since the prog is accessing both sockaddr and sock.
Adding user_ip4, user_ip6 fields to 'struct bpf_sock_ops'
is doable, but it would be too confusing to users, so imo that's
not a good option.

For post-bind hook we probably can reuse 'struct bpf_sock_ops'
and BPF_PROG_TYPE_SOCK_OPS, since there only sock is the context.


Making verifier-like log_buf + log_size generic meaning not just for the case
of BPF_PROG_ATTACH specifically might be yet another option, meaning you'd
have a new BPF_GET_ERROR command to pick up the log for the last failed bpf(2)
cmd, but either that might require adding a BPF related pointer to task struct
for this or any other future BPF feature (maybe not really an option); or to
have some sort of bpf cmd to config and obtain an fd for error queue/log once,
where this can then be retrieved from and used for all cmds generically.


I don't think we want to hold on to error logs in the kernel,
since user may not query it right away or ever.
verifier log is freed right after prog_load cmd is done.


Seems like it would potentially be on top of that, plus having an option to
attach from within the app instead of orchestrator.


right. let's worry about it as potential next step.





Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind

2018-03-14 Thread Eric Dumazet


On 03/14/2018 11:41 AM, Alexei Starovoitov wrote:
> On Wed, Mar 14, 2018 at 11:00 AM, Alexei Starovoitov
>  wrote:
>>
>>> It seems this is exactly the case where a netns would be the correct answer.
>>
>> Unfortuantely that's not the case. That's what I tried to explain
>> in the cover letter:
>> "The setup involves per-container IPs, policy, etc, so traditional
>> network-only solutions that involve VRFs, netns, acls are not applicable."
>> To elaborate more on that:
>> netns is l2 isolation.
>> vrf is l3 isolation.
>> whereas to containerize an application we need to punch connectivity holes
>> in these layered techniques.
>> We also considered resurrecting Hannes's afnetns work
>> and even went as far as designing a new namespace for L4 isolation.
>> Unfortunately all hierarchical namespace abstraction don't work.
>> To run an application inside cgroup container that was not written
>> with containers in mind we have to make an illusion of running
>> in non-containerized environment.
>> In some cases we remember the port and container id in the post-bind hook
>> in a bpf map and when some other task in a different container is trying
>> to connect to a service we need to know where this service is running.
>> It can be remote and can be local. Both client and service may or may not
>> be written with containers in mind and this sockaddr rewrite is providing
>> connectivity and load balancing feature that you simply cannot do
>> with hierarchical networking primitives.
> 
> have to explain this a bit further...
> We also considered hacking these 'connectivity holes' in
> netns and/or vrf, but that would be real layering violation,
> since clean l2, l3 abstraction would suddenly support
> something that breaks through the layers.
> Just like many consider ipvlan a bad hack that punches
> through the layers and connects l2 abstraction of netns
> at l3 layer, this is not something kernel should ever do.
> We really didn't want another ipvlan-like hack in the kernel.
> Instead bpf programs at bind/connect time _help_
> applications discover and connect to each other.
> All containers are running in init_nens and there are no vrfs.
> After bind/connect the normal fib/neighbor core networking
> logic works as it should always do. The whole system is
> clean from network point of view.


We apparently missed something when deploying ipvlan and one netns per
container/job

Full access to 64K ports, no more ports being reserved/abused.
If one job needs more, no problem, just use more than one IP per netns.

It also works with UDP just fine. Are you considering adding a hook
later for sendmsg() (unconnected socket or not), or do you want to use
the existing one in ip_finish_output(), adding per-packet overhead ?

This notion of 'clean l2, l3 abstraction' is very subjective.
I find netns isolation very clean, powerful, and it is there already.

eBPF is certainly nice, but pretending netns/ipvlan are hacks is not
credible.



Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind

2018-03-14 Thread Daniel Borkmann
On 03/14/2018 07:11 PM, Alexei Starovoitov wrote:
> On Wed, Mar 14, 2018 at 03:37:01PM +0100, Daniel Borkmann wrote:
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -133,6 +133,8 @@ enum bpf_prog_type {
>>> BPF_PROG_TYPE_SOCK_OPS,
>>> BPF_PROG_TYPE_SK_SKB,
>>> BPF_PROG_TYPE_CGROUP_DEVICE,
>>> +   BPF_PROG_TYPE_CGROUP_INET4_BIND,
>>> +   BPF_PROG_TYPE_CGROUP_INET6_BIND,
>>
>> Could those all be merged into BPF_PROG_TYPE_SOCK_OPS? I'm slowly getting
>> confused by the many sock_*/sk_* prog types we have. The attach hook could
>> still be something like BPF_CGROUP_BIND/BPF_CGROUP_CONNECT. Potentially
>> storing some prog-type specific void *private_data in prog's aux during
>> verification could be a way (similarly as you mention) which can later be
>> retrieved at attach time to reject with -ENOTSUPP or such.
> 
> that's exacly what I mentioned in the cover letter,
> but we need to extend attach cmd with verifier-like log_buf+log_size.
> since simple enotsupp will be impossible to debug.

Hmm, adding verifier-like log_buf + log_size feels a bit of a kludge just
for this case, but it's the usual problem where getting a std error code
is like looking into a crystal ball to figure where it's coming from. I'd see
only couple of other alternatives: distinct error code like ENAVAIL for such
mismatch, or telling the verifier upfront where this is going to be attached
to - same as we do with the dev for offloading or as you did with splitting
the prog types or some sort of notion of sub-prog types; or having a query
API that returns possible/compatible attach types for this loaded prog via
e.g. bpf_prog_get_info_by_fd() that loader can precheck or check when error
occurs. All nothing really nice, though.

Making verifier-like log_buf + log_size generic meaning not just for the case
of BPF_PROG_ATTACH specifically might be yet another option, meaning you'd
have a new BPF_GET_ERROR command to pick up the log for the last failed bpf(2)
cmd, but either that might require adding a BPF related pointer to task struct
for this or any other future BPF feature (maybe not really an option); or to
have some sort of bpf cmd to config and obtain an fd for error queue/log once,
where this can then be retrieved from and used for all cmds generically.

[...]
>>> +struct bpf_sock_addr {
>>> +   __u32 user_family;  /* Allows 4-byte read, but no write. */
>>> +   __u32 user_ip4; /* Allows 1,2,4-byte read and 4-byte write.
>>> +* Stored in network byte order.
>>> +*/
>>> +   __u32 user_ip6[4];  /* Allows 1,2,4-byte read an 4-byte write.
>>> +* Stored in network byte order.
>>> +*/
>>> +   __u32 user_port;/* Allows 4-byte read and write.
>>> +* Stored in network byte order
>>> +*/
>>> +   __u32 family;   /* Allows 4-byte read, but no write */
>>> +   __u32 type; /* Allows 4-byte read, but no write */
>>> +   __u32 protocol; /* Allows 4-byte read, but no write */
>>
>> I recall bind to prefix came up from time to time in BPF context in the sense
>> to let the app itself be more flexible to bind to BPF prog. Do you see also 
>> app
>> to be able to add a BPF prog into the array itself?
> 
> I'm not following. In this case the container management framework
> will attach bpf progs to cgroups and apps inside the cgroups will
> get their bind/connects overwritten when necessary.

Was mostly just thinking whether it could also cover the use case that was
brought up from time to time e.g.:

  https://www.mail-archive.com/netdev@vger.kernel.org/msg100914.html

Seems like it would potentially be on top of that, plus having an option to
attach from within the app instead of orchestrator.


Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind

2018-03-14 Thread Alexei Starovoitov
On Wed, Mar 14, 2018 at 11:00 AM, Alexei Starovoitov
 wrote:
>
>> It seems this is exactly the case where a netns would be the correct answer.
>
> Unfortuantely that's not the case. That's what I tried to explain
> in the cover letter:
> "The setup involves per-container IPs, policy, etc, so traditional
> network-only solutions that involve VRFs, netns, acls are not applicable."
> To elaborate more on that:
> netns is l2 isolation.
> vrf is l3 isolation.
> whereas to containerize an application we need to punch connectivity holes
> in these layered techniques.
> We also considered resurrecting Hannes's afnetns work
> and even went as far as designing a new namespace for L4 isolation.
> Unfortunately all hierarchical namespace abstraction don't work.
> To run an application inside cgroup container that was not written
> with containers in mind we have to make an illusion of running
> in non-containerized environment.
> In some cases we remember the port and container id in the post-bind hook
> in a bpf map and when some other task in a different container is trying
> to connect to a service we need to know where this service is running.
> It can be remote and can be local. Both client and service may or may not
> be written with containers in mind and this sockaddr rewrite is providing
> connectivity and load balancing feature that you simply cannot do
> with hierarchical networking primitives.

have to explain this a bit further...
We also considered hacking these 'connectivity holes' in
netns and/or vrf, but that would be real layering violation,
since clean l2, l3 abstraction would suddenly support
something that breaks through the layers.
Just like many consider ipvlan a bad hack that punches
through the layers and connects l2 abstraction of netns
at l3 layer, this is not something kernel should ever do.
We really didn't want another ipvlan-like hack in the kernel.
Instead bpf programs at bind/connect time _help_
applications discover and connect to each other.
All containers are running in init_nens and there are no vrfs.
After bind/connect the normal fib/neighbor core networking
logic works as it should always do. The whole system is
clean from network point of view.


Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind

2018-03-14 Thread Alexei Starovoitov
On Wed, Mar 14, 2018 at 03:37:01PM +0100, Daniel Borkmann wrote:
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -133,6 +133,8 @@ enum bpf_prog_type {
> > BPF_PROG_TYPE_SOCK_OPS,
> > BPF_PROG_TYPE_SK_SKB,
> > BPF_PROG_TYPE_CGROUP_DEVICE,
> > +   BPF_PROG_TYPE_CGROUP_INET4_BIND,
> > +   BPF_PROG_TYPE_CGROUP_INET6_BIND,
> 
> Could those all be merged into BPF_PROG_TYPE_SOCK_OPS? I'm slowly getting
> confused by the many sock_*/sk_* prog types we have. The attach hook could
> still be something like BPF_CGROUP_BIND/BPF_CGROUP_CONNECT. Potentially
> storing some prog-type specific void *private_data in prog's aux during
> verification could be a way (similarly as you mention) which can later be
> retrieved at attach time to reject with -ENOTSUPP or such.

that's exacly what I mentioned in the cover letter,
but we need to extend attach cmd with verifier-like log_buf+log_size.
since simple enotsupp will be impossible to debug.
That's the main question of the RFC.

> >  };
> >  
> >  enum bpf_attach_type {
> > @@ -143,6 +145,8 @@ enum bpf_attach_type {
> > BPF_SK_SKB_STREAM_PARSER,
> > BPF_SK_SKB_STREAM_VERDICT,
> > BPF_CGROUP_DEVICE,
> > +   BPF_CGROUP_INET4_BIND,
> > +   BPF_CGROUP_INET6_BIND,
> 
> Binding to v4 mapped v6 address would work as well, right? Can't this be
> squashed into one attach type as mentioned?

explained the reasons for this in the cover letter and proposed extension
to attach cmd.

> > +struct bpf_sock_addr {
> > +   __u32 user_family;  /* Allows 4-byte read, but no write. */
> > +   __u32 user_ip4; /* Allows 1,2,4-byte read and 4-byte write.
> > +* Stored in network byte order.
> > +*/
> > +   __u32 user_ip6[4];  /* Allows 1,2,4-byte read an 4-byte write.
> > +* Stored in network byte order.
> > +*/
> > +   __u32 user_port;/* Allows 4-byte read and write.
> > +* Stored in network byte order
> > +*/
> > +   __u32 family;   /* Allows 4-byte read, but no write */
> > +   __u32 type; /* Allows 4-byte read, but no write */
> > +   __u32 protocol; /* Allows 4-byte read, but no write */
> 
> I recall bind to prefix came up from time to time in BPF context in the sense
> to let the app itself be more flexible to bind to BPF prog. Do you see also 
> app
> to be able to add a BPF prog into the array itself?

I'm not following. In this case the container management framework
will attach bpf progs to cgroups and apps inside the cgroups will
get their bind/connects overwritten when necessary.

> > +int __cgroup_bpf_run_filter_sock_addr(struct sock *sk,
> > + struct sockaddr *uaddr,
> > + enum bpf_attach_type type)
> > +{
> > +   struct bpf_sock_addr_kern ctx = {
> > +   .sk = sk,
> > +   .uaddr = uaddr,
> > +   };
> > +   struct cgroup *cgrp;
> > +   int ret;
> > +
> > +   /* Check socket family since not all sockets represent network
> > +* endpoint (e.g. AF_UNIX).
> > +*/
> > +   if (sk->sk_family != AF_INET && sk->sk_family != AF_INET6)
> > +   return 0;
> > +
> > +   cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> > +   ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], &ctx, BPF_PROG_RUN);
> > +
> > +   return ret == 1 ? 0 : -EPERM;
> 
> Semantics may be a little bit strange, though this would perhaps be at the 
> risk
> of the orchestrator(s) (?). Basically when you run through the prog array, 
> then
> the last attached program in that array has the final /real/ say to which 
> address
> to bind/connect to; all the others decisions can freely be overridden, so this
> is dependent on the order the BPF progs how they were attached. I think we 
> don't
> have this case for context in multi-prog tracing, cgroup/inet (only filtering)
> and cgroup/dev. Although in cgroup/sock_ops for the tcp/BPF hooks progs can 
> already
> override the sock_ops.reply (and sk_txhash which may be less relevant) field 
> from
> the ctx, so whatever one prog is assumed to reply back to the caller, another 
> one
> could override it. 

correct. tcp-bpf is in the same boat. When progs override the decision the last
prog in the prog_run_array is effective. Remember that
 * The programs of sub-cgroup are executed first, then programs of
 * this cgroup and then programs of parent cgroup.
so outer cgroup controlled by container management is running last.
If it would want to let children do nested overwrittes it could look at the same
sockaddr memory region and will see what children's prog or children's tasks
did with sockaddr and make approriate decision.

> Wouldn't it make more sense to just have a single prog instead
> to avoid this override/ordering issue?

I don't think there is any ordering issue, but yes, if parent is paranoid
it can install no-override pr

Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind

2018-03-14 Thread Alexei Starovoitov
On Tue, Mar 13, 2018 at 11:21:08PM -0700, Eric Dumazet wrote:
> 
> If I understand well,  strace(1) will not show the real (after modification
> by eBPF) IP/port ?

correct. Just like it won't show anything after syscall entry, whether
lsm acted, seccomp, etc

> What about selinux and other LSM ?

clearly lsm is not place to do ip/port enforcement for containers.
lsm in general is missing post-bind lsm hook and visibility in cgroups.
This patch set is not about policy, but more about connectivity.
That's why sockaddr rewrite is must have.

> We have now network namespaces for full isolation. Soon ILA will come.

we're already using a form of ila. That's orthogonal to this feature.

> The argument that it is not convenient (or even possible) to change the
> application or using modern isolation is quite strange, considering the

just like any other datacenter there are thousands of third party
applications that we cannot control. Including open source code
written by google. Would golang switch to use glibc? I very much doubt.
Statically linked apps also don't work with ld_preload.

> added burden/complexity/bloat to the kernel.

bloat? that's very odd to hear. bpf is very much anti-bloat technique.
If you were serious with that comment, please argue with tracing folks
who add thousand upon thousand lines of code to the kernel to do
hard coded things while bpf already does all that and more
without any extra kernel code.

> The post hook for sys_bind is clearly a failure of the model, since
> releasing the port might already be too late, another thread might fail to
> get it during a non zero time window.

I suspect commit log wasn't clear. In post-bind hook we don't release
the port, we only fail sys_bind and user space will eventually close
the socket and release the port.
I don't think it's safe to call inet_put_port() here. It is also
racy as you pointed out.

> If you want to provide an alternate port allocation strategy, better provide
> a correct eBPF hook.

right. that's another separate work indepedent from this feature.
port allocation/free from bpf via helper is also necessary, but
for different use case.

> It seems this is exactly the case where a netns would be the correct answer.

Unfortuantely that's not the case. That's what I tried to explain
in the cover letter:
"The setup involves per-container IPs, policy, etc, so traditional
network-only solutions that involve VRFs, netns, acls are not applicable."
To elaborate more on that:
netns is l2 isolation.
vrf is l3 isolation.
whereas to containerize an application we need to punch connectivity holes
in these layered techniques.
We also considered resurrecting Hannes's afnetns work
and even went as far as designing a new namespace for L4 isolation.
Unfortunately all hierarchical namespace abstraction don't work.
To run an application inside cgroup container that was not written
with containers in mind we have to make an illusion of running
in non-containerized environment.
In some cases we remember the port and container id in the post-bind hook
in a bpf map and when some other task in a different container is trying
to connect to a service we need to know where this service is running.
It can be remote and can be local. Both client and service may or may not
be written with containers in mind and this sockaddr rewrite is providing
connectivity and load balancing feature that you simply cannot do
with hierarchical networking primitives.

btw the per-container policy enforcement of ip+port via these hooks
wasn't our planned feature. It was requested by other folks and
we had to tweak the api a little bit to satisfy ours and theirs requirement.



Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind

2018-03-14 Thread Daniel Borkmann
On 03/14/2018 03:37 PM, Daniel Borkmann wrote:
> On 03/14/2018 04:39 AM, Alexei Starovoitov wrote:
> [...]
>> +#define BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr)  \
>> +BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET4_BIND)
>> +
>> +#define BPF_CGROUP_RUN_PROG_INET6_BIND(sk, uaddr)  \
>> +BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET6_BIND)
>> +
>>  #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops)  
>>\
>>  ({ \
>>  int __ret = 0; \
>> @@ -135,6 +154,8 @@ static inline int cgroup_bpf_inherit(struct cgroup 
>> *cgrp) { return 0; }
>>  #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb) ({ 0; })
>>  #define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk,skb) ({ 0; })
>>  #define BPF_CGROUP_RUN_PROG_INET_SOCK(sk) ({ 0; })
>> +#define BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr) ({ 0; })
>> +#define BPF_CGROUP_RUN_PROG_INET6_BIND(sk, uaddr) ({ 0; })
>>  #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; })
>>  #define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; })
>>  
>> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
>> index 19b8349a3809..eefd877f8e68 100644
>> --- a/include/linux/bpf_types.h
>> +++ b/include/linux/bpf_types.h
>> @@ -8,6 +8,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_SCHED_ACT, tc_cls_act)
>>  BPF_PROG_TYPE(BPF_PROG_TYPE_XDP, xdp)
>>  BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SKB, cg_skb)
>>  BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK, cg_sock)
>> +BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET4_BIND, cg_inet4_bind)
>> +BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET6_BIND, cg_inet6_bind)
>>  BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_IN, lwt_inout)
>>  BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_OUT, lwt_inout)
>>  BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_XMIT, lwt_xmit)
>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>> index fdb691b520c0..fe469320feab 100644
>> --- a/include/linux/filter.h
>> +++ b/include/linux/filter.h
>> @@ -1001,6 +1001,16 @@ static inline int bpf_tell_extensions(void)
>>  return SKF_AD_MAX;
>>  }
>>  
>> +struct bpf_sock_addr_kern {
>> +struct sock *sk;
>> +struct sockaddr *uaddr;
>> +/* Temporary "register" to make indirect stores to nested structures
>> + * defined above. We need three registers to make such a store, but
>> + * only two (src and dst) are available at convert_ctx_access time
>> + */
>> +u64 tmp_reg;
>> +};
>> +
>>  struct bpf_sock_ops_kern {
>>  struct  sock *sk;
>>  u32 op;
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 2a66769e5875..78628a3f3cd8 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -133,6 +133,8 @@ enum bpf_prog_type {
>>  BPF_PROG_TYPE_SOCK_OPS,
>>  BPF_PROG_TYPE_SK_SKB,
>>  BPF_PROG_TYPE_CGROUP_DEVICE,
>> +BPF_PROG_TYPE_CGROUP_INET4_BIND,
>> +BPF_PROG_TYPE_CGROUP_INET6_BIND,
> 
> Could those all be merged into BPF_PROG_TYPE_SOCK_OPS? I'm slowly getting
> confused by the many sock_*/sk_* prog types we have. The attach hook could
> still be something like BPF_CGROUP_BIND/BPF_CGROUP_CONNECT. Potentially
> storing some prog-type specific void *private_data in prog's aux during
> verification could be a way (similarly as you mention) which can later be
> retrieved at attach time to reject with -ENOTSUPP or such.
> 
>>  };
>>  
>>  enum bpf_attach_type {
>> @@ -143,6 +145,8 @@ enum bpf_attach_type {
>>  BPF_SK_SKB_STREAM_PARSER,
>>  BPF_SK_SKB_STREAM_VERDICT,
>>  BPF_CGROUP_DEVICE,
>> +BPF_CGROUP_INET4_BIND,
>> +BPF_CGROUP_INET6_BIND,
> 
> Binding to v4 mapped v6 address would work as well, right? Can't this be
> squashed into one attach type as mentioned?
> 
>>  __MAX_BPF_ATTACH_TYPE
>>  };
>>  
>> @@ -953,6 +957,26 @@ struct bpf_map_info {
>>  __u64 netns_ino;
>>  } __attribute__((aligned(8)));
>>  
>> +/* User bpf_sock_addr struct to access socket fields and sockaddr struct 
>> passed
>> + * by user and intended to be used by socket (e.g. to bind to, depends on
>> + * attach attach type).
>> + */
>> +struct bpf_sock_addr {
>> +__u32 user_family;  /* Allows 4-byte read, but no write. */
>> +__u32 user_ip4; /* Allows 1,2,4-byte read and 4-byte write.
>> + * Stored in network byte order.
>> + */
>> +__u32 user_ip6[4];  /* Allows 1,2,4-byte read an 4-byte write.
>> + * Stored in network byte order.
>> + */
>> +__u32 user_port;/* Allows 4-byte read and write.
>> + * Stored in network byte order
>> + */
>> +__u32 family;   /* Allows 4-byte read, but no write */
>> +__u32 type; /* Allows 4-byte read, but no write */
>> +__u32 protocol; /* Allows 4-byte 

Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind

2018-03-14 Thread Daniel Borkmann
On 03/14/2018 04:39 AM, Alexei Starovoitov wrote:
[...]
> +#define BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr)   \
> + BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET4_BIND)
> +
> +#define BPF_CGROUP_RUN_PROG_INET6_BIND(sk, uaddr)   \
> + BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET6_BIND)
> +
>  #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops)   
>\
>  ({  \
>   int __ret = 0; \
> @@ -135,6 +154,8 @@ static inline int cgroup_bpf_inherit(struct cgroup *cgrp) 
> { return 0; }
>  #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb) ({ 0; })
>  #define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk,skb) ({ 0; })
>  #define BPF_CGROUP_RUN_PROG_INET_SOCK(sk) ({ 0; })
> +#define BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr) ({ 0; })
> +#define BPF_CGROUP_RUN_PROG_INET6_BIND(sk, uaddr) ({ 0; })
>  #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; })
>  #define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; })
>  
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 19b8349a3809..eefd877f8e68 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -8,6 +8,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_SCHED_ACT, tc_cls_act)
>  BPF_PROG_TYPE(BPF_PROG_TYPE_XDP, xdp)
>  BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SKB, cg_skb)
>  BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK, cg_sock)
> +BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET4_BIND, cg_inet4_bind)
> +BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET6_BIND, cg_inet6_bind)
>  BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_IN, lwt_inout)
>  BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_OUT, lwt_inout)
>  BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_XMIT, lwt_xmit)
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index fdb691b520c0..fe469320feab 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1001,6 +1001,16 @@ static inline int bpf_tell_extensions(void)
>   return SKF_AD_MAX;
>  }
>  
> +struct bpf_sock_addr_kern {
> + struct sock *sk;
> + struct sockaddr *uaddr;
> + /* Temporary "register" to make indirect stores to nested structures
> +  * defined above. We need three registers to make such a store, but
> +  * only two (src and dst) are available at convert_ctx_access time
> +  */
> + u64 tmp_reg;
> +};
> +
>  struct bpf_sock_ops_kern {
>   struct  sock *sk;
>   u32 op;
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 2a66769e5875..78628a3f3cd8 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -133,6 +133,8 @@ enum bpf_prog_type {
>   BPF_PROG_TYPE_SOCK_OPS,
>   BPF_PROG_TYPE_SK_SKB,
>   BPF_PROG_TYPE_CGROUP_DEVICE,
> + BPF_PROG_TYPE_CGROUP_INET4_BIND,
> + BPF_PROG_TYPE_CGROUP_INET6_BIND,

Could those all be merged into BPF_PROG_TYPE_SOCK_OPS? I'm slowly getting
confused by the many sock_*/sk_* prog types we have. The attach hook could
still be something like BPF_CGROUP_BIND/BPF_CGROUP_CONNECT. Potentially
storing some prog-type specific void *private_data in prog's aux during
verification could be a way (similarly as you mention) which can later be
retrieved at attach time to reject with -ENOTSUPP or such.

>  };
>  
>  enum bpf_attach_type {
> @@ -143,6 +145,8 @@ enum bpf_attach_type {
>   BPF_SK_SKB_STREAM_PARSER,
>   BPF_SK_SKB_STREAM_VERDICT,
>   BPF_CGROUP_DEVICE,
> + BPF_CGROUP_INET4_BIND,
> + BPF_CGROUP_INET6_BIND,

Binding to v4 mapped v6 address would work as well, right? Can't this be
squashed into one attach type as mentioned?

>   __MAX_BPF_ATTACH_TYPE
>  };
>  
> @@ -953,6 +957,26 @@ struct bpf_map_info {
>   __u64 netns_ino;
>  } __attribute__((aligned(8)));
>  
> +/* User bpf_sock_addr struct to access socket fields and sockaddr struct 
> passed
> + * by user and intended to be used by socket (e.g. to bind to, depends on
> + * attach attach type).
> + */
> +struct bpf_sock_addr {
> + __u32 user_family;  /* Allows 4-byte read, but no write. */
> + __u32 user_ip4; /* Allows 1,2,4-byte read and 4-byte write.
> +  * Stored in network byte order.
> +  */
> + __u32 user_ip6[4];  /* Allows 1,2,4-byte read an 4-byte write.
> +  * Stored in network byte order.
> +  */
> + __u32 user_port;/* Allows 4-byte read and write.
> +  * Stored in network byte order
> +  */
> + __u32 family;   /* Allows 4-byte read, but no write */
> + __u32 type; /* Allows 4-byte read, but no write */
> + __u32 protocol; /* Allows 4-byte read, but no write */

I recall bind to prefix came up from time to time in BPF context in the sense
to let the app itself be more f

Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind

2018-03-13 Thread Eric Dumazet



On 03/13/2018 08:39 PM, Alexei Starovoitov wrote:

From: Andrey Ignatov 

== The problem ==

There is a use-case when all processes inside a cgroup should use one
single IP address on a host that has multiple IP configured.  Those
processes should use the IP for both ingress and egress, for TCP and UDP
traffic. So TCP/UDP servers should be bound to that IP to accept
incoming connections on it, and TCP/UDP clients should make outgoing
connections from that IP. It should not require changing application
code since it's often not possible.

Currently it's solved by intercepting glibc wrappers around syscalls
such as `bind(2)` and `connect(2)`. It's done by a shared library that
is preloaded for every process in a cgroup so that whenever TCP/UDP
server calls `bind(2)`, the library replaces IP in sockaddr before
passing arguments to syscall. When application calls `connect(2)` the
library transparently binds the local end of connection to that IP
(`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty).

Shared library approach is fragile though, e.g.:
* some applications clear env vars (incl. `LD_PRELOAD`);
* `/etc/ld.so.preload` doesn't help since some applications are linked
   with option `-z nodefaultlib`;
* other applications don't use glibc and there is nothing to intercept.

== The solution ==

The patch provides much more reliable in-kernel solution for the 1st
part of the problem: binding TCP/UDP servers on desired IP. It does not
depend on application environment and implementation details (whether
glibc is used or not).




If I understand well,  strace(1) will not show the real (after 
modification by eBPF) IP/port ?


What about selinux and other LSM ?

We have now network namespaces for full isolation. Soon ILA will come.

The argument that it is not convenient (or even possible) to change the 
application or using modern isolation is quite strange, considering the 
added burden/complexity/bloat to the kernel.


The post hook for sys_bind is clearly a failure of the model, since 
releasing the port might already be too late, another thread might fail 
to get it during a non zero time window.

It seems this is exactly the case where a netns would be the correct answer.


If you want to provide an alternate port allocation strategy, better 
provide a correct eBPF hook.





[PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind

2018-03-13 Thread Alexei Starovoitov
From: Andrey Ignatov 

== The problem ==

There is a use-case when all processes inside a cgroup should use one
single IP address on a host that has multiple IP configured.  Those
processes should use the IP for both ingress and egress, for TCP and UDP
traffic. So TCP/UDP servers should be bound to that IP to accept
incoming connections on it, and TCP/UDP clients should make outgoing
connections from that IP. It should not require changing application
code since it's often not possible.

Currently it's solved by intercepting glibc wrappers around syscalls
such as `bind(2)` and `connect(2)`. It's done by a shared library that
is preloaded for every process in a cgroup so that whenever TCP/UDP
server calls `bind(2)`, the library replaces IP in sockaddr before
passing arguments to syscall. When application calls `connect(2)` the
library transparently binds the local end of connection to that IP
(`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty).

Shared library approach is fragile though, e.g.:
* some applications clear env vars (incl. `LD_PRELOAD`);
* `/etc/ld.so.preload` doesn't help since some applications are linked
  with option `-z nodefaultlib`;
* other applications don't use glibc and there is nothing to intercept.

== The solution ==

The patch provides much more reliable in-kernel solution for the 1st
part of the problem: binding TCP/UDP servers on desired IP. It does not
depend on application environment and implementation details (whether
glibc is used or not).

It adds new eBPF program types `BPF_PROG_TYPE_CGROUP_INET4_BIND` and
`BPF_PROG_TYPE_CGROUP_INET6_BIND` and corresponding attach types
`BPF_CGROUP_INET4_BIND` and `BPF_CGROUP_INET6_BIND` (similar to already
existing `BPF_CGROUP_INET_SOCK_CREATE`).

The new program types are intended to be used with sockets (`struct sock`)
in a cgroup and provided by user `struct sockaddr`. Pointers to both of
them are parts of the context passed to programs of newly added types.

The new attach types provides hooks in `bind(2)` system call for both
IPv4 and IPv6 so that one can write a program to override IP addresses
and ports user program tries to bind to and apply such a program for
whole cgroup.

== Implementation notes ==

[1]
Separate prog/attach types for `AF_INET` and `AF_INET6` are added
intentionally to prevent reading/writing to offsets that don't make
sense for corresponding socket family. E.g. if user passes `sockaddr_in`
it doesn't make sense to read from / write to `user_ip6[]` context
fields.

[2]
The write access to `struct bpf_sock_addr_kern` is implemented using
special field as an additional "register".

There are just two registers in `sock_addr_convert_ctx_access`: `src`
with value to write and `dst` with pointer to context that can't be
changed not to break later instructions. But the fields, allowed to
write to, are not available directly and to access them address of
corresponding pointer has to be loaded first. To get additional register
the 1st not used by `src` and `dst` one is taken, its content is saved
to `bpf_sock_addr_kern.tmp_reg`, then the register is used to load
address of pointer field, and finally the register's content is restored
from the temporary field after writing `src` value.

Signed-off-by: Andrey Ignatov 
Acked-by: Alexei Starovoitov 
Signed-off-by: Alexei Starovoitov 
---
 include/linux/bpf-cgroup.h |  21 
 include/linux/bpf_types.h  |   2 +
 include/linux/filter.h |  10 ++
 include/uapi/linux/bpf.h   |  24 +
 kernel/bpf/cgroup.c|  36 +++
 kernel/bpf/syscall.c   |  14 +++
 kernel/bpf/verifier.c  |   2 +
 net/core/filter.c  | 242 +
 net/ipv4/af_inet.c |   7 ++
 net/ipv6/af_inet6.c|   7 ++
 10 files changed, 365 insertions(+)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 8a4566691c8f..dd0cfbddcfbe 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -6,6 +6,7 @@
 #include 
 
 struct sock;
+struct sockaddr;
 struct cgroup;
 struct sk_buff;
 struct bpf_sock_ops_kern;
@@ -63,6 +64,10 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
 int __cgroup_bpf_run_filter_sk(struct sock *sk,
   enum bpf_attach_type type);
 
+int __cgroup_bpf_run_filter_sock_addr(struct sock *sk,
+ struct sockaddr *uaddr,
+ enum bpf_attach_type type);
+
 int __cgroup_bpf_run_filter_sock_ops(struct sock *sk,
 struct bpf_sock_ops_kern *sock_ops,
 enum bpf_attach_type type);
@@ -103,6 +108,20 @@ int __cgroup_bpf_check_dev_permission(short dev_type, u32 
major, u32 minor,
__ret; \
 })
 
+#define BPF_CGROUP_RUN_SA_PROG(sk, uaddr, type)   \
+({