Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
Hello, Eric. On Thu, Jan 26, 2017 at 01:45:07PM +1300, Eric W. Biederman wrote: > > Eric, does this sound okay to you? You're the authority on exposing > > things like namespace ids to users. > > *Boggle* Things that run across all network namespaces break any kind > of sense I have about thinking about them. > > Running across more than one network namespace by default seems very > broken to me. Can you explain why that is? Other namespaces don't behave this way. For example, a PID namespace doesn't hide the processes at the system level. It just gives additional nested names to the namespaced processes and having objects visible at the system level is very useful for monitoring and management. Are there inherent reasons why network namespace should be very different from other namespaces in this regard? Thanks. -- tejun
Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
Andy Lutomirski writes: > On Tue, Jan 24, 2017 at 4:11 PM, Alexei Starovoitov > wrote: >> On Tue, Jan 24, 2017 at 01:24:54PM -0800, Andy Lutomirski wrote: >>> On Tue, Jan 24, 2017 at 12:29 PM, David Ahern >>> wrote: >>> > >>> > Users do not run around exec'ing commands in random network contexts >>> > (namespace, vrf, device, whatever) and expect them to just work. >>> > >>> setns() is a bit more complicated, but it should still be fine. >>> netns_install() requires CAP_NET_ADMIN over the target netns, so you >>> can only switch in to a netns if you already have privilege in that >>> netns. >> >> it's not fine. Consider our main use case which is cgroup-scoped traffic >> monitoring and enforcement for well behaving apps. > > [...] > > That does indeed sound like a sensible use case. Thanks! > >> >>> But perhaps they should be less disjoint. As far as I know, >>> cgroup+bpf is the *only* network configuration that is being set up to >>> run across all network namespaces. No one has said why this is okay, >>> let alone why it's preferable to making it work per-netns just like >>> basically all other Linux network configuration. >> >> Single cls_bpf program attached to all netdevs in tc egress >> hook can call bpf_skb_under_cgroup() to check whether given skb >> is under given cgroup and afterwards decide to drop/redirect. >> In terms of 'run across all netns' it's exactly the same. >> Our monitoring use case came out of it. >> Unfortunately bpf_skb_under_cgroup() in cls_bpf is not scalable. >> It works fine for quick on the box monitoring where user logs in >> to the box and can see what particular job is doing, >> but it's not usable when there are thousands of cgroups and >> we need to monitor them all. >> >>> I was hoping for an actual likely use case for the bpf hooks to be run >>> in all namespaces. You're arguing that iproute2 can be made to work >>> mostly okay if bpf hooks can run in all namespaces, but the use case >>> of intentionally making sk_bound_dev_if invalid across all namespaces >>> seems dubious. >> >> I think what Tejun is proposing regarding adding global_ifindex >> is a great idea and it will make ifindex interaction cleaner not only >> for cgroup+bpf, but for socket and cls_bpf programs too. >> I think it would be ok to disallow unprivileged programs (whenever >> they come) to use of bpf_sock->bound_dev_if and only >> allow bpf_sock->global_bound_dev_if and that should solve >> your security concern for future unpriv programs. >> >> I think bpf_get_sock_netns_id() helper or sk->netns_id field would >> be good addition as well, since it will allow 'ip vrf' to be smarter today. >> It's also more flexible, since bpf_type_cgroup_sock program can make >> its own decision when netns_id != expecte_id instead of hard coded. >> Today the ip vrf use case does: 'sk->bound_dev_if = idx; return OK;' >> it will be able to: >> if (sk->netns_id != expected_id) >> return DROP; >> sk->bound_dev_if = idx; >> return OK; >> or >> if (sk->netns_id != expected_id) >> return OK; >> sk->bound_dev_if = idx; >> return OK; >> or it will be able to bpf_trace_printk() or bpf_perf_event_output() >> to send notification to vrf user space daemon and so on. >> Such checks will run at the same speed as checks inside >> __cgroup_bpf_run_filter_sk(), but all other users won't pay >> for them when not in use. imo it's a win-win. > > Eric, does this sound okay to you? You're the authority on exposing > things like namespace ids to users. *Boggle* Things that run across all network namespaces break any kind of sense I have about thinking about them. Running across more than one network namespace by default seems very broken to me. ifindex is definitely a per network namespace property. We do have a netns_id in the network stack, but that is also network namespace local. It is just the local name of another network namespace. I can't imagine how any of those identifiers would make any sense in a context that was not network namespace specific. It has to be at a minimum be interpreted in the context of the network namespace things are made in. There are also some huge security and maintenance implications if there is a filter that can run acrosss all network namespaces. As it sounds like it can grant processes visibility into network namespaces they do not already have. Which can easily break things like Checkpoint/Restart as well as having rather large implications for information leaks and interference in namespaces that you a process would not otherwise have access to. When I have looked cgroups + network stack has never worked well in any use case I have seen. As cgroups are per process and that makes things like that only appropriate for very close to the process transmit decissions. AKA is it ok to send this process to a socket controlled by this PID. I don't have the full context but I am have a hard time imagining anything sensible going on. I like the subject. Let's re
Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
On Tue, Jan 24, 2017 at 4:11 PM, Alexei Starovoitov wrote: > On Tue, Jan 24, 2017 at 01:24:54PM -0800, Andy Lutomirski wrote: >> On Tue, Jan 24, 2017 at 12:29 PM, David Ahern >> wrote: >> > >> > Users do not run around exec'ing commands in random network contexts >> > (namespace, vrf, device, whatever) and expect them to just work. >> >> setns() is a bit more complicated, but it should still be fine. >> netns_install() requires CAP_NET_ADMIN over the target netns, so you >> can only switch in to a netns if you already have privilege in that >> netns. > > it's not fine. Consider our main use case which is cgroup-scoped traffic > monitoring and enforcement for well behaving apps. [...] That does indeed sound like a sensible use case. Thanks! > >> But perhaps they should be less disjoint. As far as I know, >> cgroup+bpf is the *only* network configuration that is being set up to >> run across all network namespaces. No one has said why this is okay, >> let alone why it's preferable to making it work per-netns just like >> basically all other Linux network configuration. > > Single cls_bpf program attached to all netdevs in tc egress > hook can call bpf_skb_under_cgroup() to check whether given skb > is under given cgroup and afterwards decide to drop/redirect. > In terms of 'run across all netns' it's exactly the same. > Our monitoring use case came out of it. > Unfortunately bpf_skb_under_cgroup() in cls_bpf is not scalable. > It works fine for quick on the box monitoring where user logs in > to the box and can see what particular job is doing, > but it's not usable when there are thousands of cgroups and > we need to monitor them all. > >> I was hoping for an actual likely use case for the bpf hooks to be run >> in all namespaces. You're arguing that iproute2 can be made to work >> mostly okay if bpf hooks can run in all namespaces, but the use case >> of intentionally making sk_bound_dev_if invalid across all namespaces >> seems dubious. > > I think what Tejun is proposing regarding adding global_ifindex > is a great idea and it will make ifindex interaction cleaner not only > for cgroup+bpf, but for socket and cls_bpf programs too. > I think it would be ok to disallow unprivileged programs (whenever > they come) to use of bpf_sock->bound_dev_if and only > allow bpf_sock->global_bound_dev_if and that should solve > your security concern for future unpriv programs. > > I think bpf_get_sock_netns_id() helper or sk->netns_id field would > be good addition as well, since it will allow 'ip vrf' to be smarter today. > It's also more flexible, since bpf_type_cgroup_sock program can make > its own decision when netns_id != expecte_id instead of hard coded. > Today the ip vrf use case does: 'sk->bound_dev_if = idx; return OK;' > it will be able to: > if (sk->netns_id != expected_id) > return DROP; > sk->bound_dev_if = idx; > return OK; > or > if (sk->netns_id != expected_id) > return OK; > sk->bound_dev_if = idx; > return OK; > or it will be able to bpf_trace_printk() or bpf_perf_event_output() > to send notification to vrf user space daemon and so on. > Such checks will run at the same speed as checks inside > __cgroup_bpf_run_filter_sk(), but all other users won't pay > for them when not in use. imo it's a win-win. Eric, does this sound okay to you? You're the authority on exposing things like namespace ids to users. > > In parallel I'm planning to work on 'disallow override' flag > for bpf_prog_attach. That should solve remaining concern. > I still disagree about urgency of implementing it, but > it's simple enough, so why not now. > Can you describe the exact semantics? I really think there should be room for adding a mode where all the relevant programs are invoked and that this should eventually be the default. It's the obvious behavior and it has many fewer footguns. I would propose the following: - By default, socket creation and egress filters call all registered programs, innermost cgroup first. If any of them return a reject code, stop processing further filters and reject. By default, ingress filters call all registered programs, innermost cgroup first. If any of them return a reject code, stop processing further filters and reject. This is indended to Do The Right Thing (TM) for both monitoring systems and for sandboxing. For monitoring, if you stick your hooks in the /a cgroup, you presumably want to see all traffic that actually goes in and out. For ingress, you want to see what's really there before it gets further modified by hooks set up by the monitored application (which Alexei noted above can run as root). For egress, you want to see what really egresses the machine, after any modifications made by BPF programs set up by the application. In particular, if the application sends a packet but then rejects it via its own hook, you *don't* want to see it because it won't actually go out on the wire and you shouldn't charge the application for it. For
Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
On 1/24/17 5:11 PM, Alexei Starovoitov wrote: > I think bpf_get_sock_netns_id() helper or sk->netns_id field would > be good addition as well, since it will allow 'ip vrf' to be smarter today. > It's also more flexible, since bpf_type_cgroup_sock program can make > its own decision when netns_id != expecte_id instead of hard coded. > Today the ip vrf use case does: 'sk->bound_dev_if = idx; return OK;' > it will be able to: > if (sk->netns_id != expected_id) > return DROP; > sk->bound_dev_if = idx; > return OK; > or > if (sk->netns_id != expected_id) > return OK; > sk->bound_dev_if = idx; > return OK; I agree
Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
On Tue, Jan 24, 2017 at 01:24:54PM -0800, Andy Lutomirski wrote: > On Tue, Jan 24, 2017 at 12:29 PM, David Ahern > wrote: > > > > Users do not run around exec'ing commands in random network contexts > > (namespace, vrf, device, whatever) and expect them to just work. > > I worked on some code once (deployed in production, even!) that calls > unshare() to make a netns and creates an interface and runs code in > there and expects it to just work. that is exactly the counter argument to your proposal which is making cgroups ignore bpf programs acting on sockets/apps that are no longer in the root ns. If app can escape cgroup by switching netns, it breaks the scoping assumption in a way that user space control plane that uses cgroup+bpf cannot oversee. The program should just work and stay visible to bpf program within that cgroup scope. More below. > setns() is a bit more complicated, but it should still be fine. > netns_install() requires CAP_NET_ADMIN over the target netns, so you > can only switch in to a netns if you already have privilege in that > netns. it's not fine. Consider our main use case which is cgroup-scoped traffic monitoring and enforcement for well behaving apps. The container management framework cannot use sandboxing and monitor all syscalls, because it's unacceptable overhead for production apps. These apps do unknown things including netns. Some of them run as root too. The container management needs to see what these apps are doing from networking point of view with lowest possible overhead. While cgroup+bpf hook is independent from netns, it's all fine. Initially the program will simply count bytes/packets and associate them with jobs where job is a collection of processes. In some cases we need to restrict whom these jobs can talk to and amount of traffic they send. Container management doesn't place jobs into netns-es today, because veth overhead is too high. Something like afnetns might be an option. Whether it's going to be netns or afnetns should be orthogonal to cgroup scoped monitoring. A job per cgroup may have multiple netns inside and the other way around. Multiple cgroup scopes within single netns. I don't think there is a case where cgroup/netns scopes will be misaligned, like: cgroup A + netns 1 and 2 cgroup B + netns 2 and 3 but I don't see a reason to restrict that either. > But perhaps they should be less disjoint. As far as I know, > cgroup+bpf is the *only* network configuration that is being set up to > run across all network namespaces. No one has said why this is okay, > let alone why it's preferable to making it work per-netns just like > basically all other Linux network configuration. Single cls_bpf program attached to all netdevs in tc egress hook can call bpf_skb_under_cgroup() to check whether given skb is under given cgroup and afterwards decide to drop/redirect. In terms of 'run across all netns' it's exactly the same. Our monitoring use case came out of it. Unfortunately bpf_skb_under_cgroup() in cls_bpf is not scalable. It works fine for quick on the box monitoring where user logs in to the box and can see what particular job is doing, but it's not usable when there are thousands of cgroups and we need to monitor them all. > I was hoping for an actual likely use case for the bpf hooks to be run > in all namespaces. You're arguing that iproute2 can be made to work > mostly okay if bpf hooks can run in all namespaces, but the use case > of intentionally making sk_bound_dev_if invalid across all namespaces > seems dubious. I think what Tejun is proposing regarding adding global_ifindex is a great idea and it will make ifindex interaction cleaner not only for cgroup+bpf, but for socket and cls_bpf programs too. I think it would be ok to disallow unprivileged programs (whenever they come) to use of bpf_sock->bound_dev_if and only allow bpf_sock->global_bound_dev_if and that should solve your security concern for future unpriv programs. I think bpf_get_sock_netns_id() helper or sk->netns_id field would be good addition as well, since it will allow 'ip vrf' to be smarter today. It's also more flexible, since bpf_type_cgroup_sock program can make its own decision when netns_id != expecte_id instead of hard coded. Today the ip vrf use case does: 'sk->bound_dev_if = idx; return OK;' it will be able to: if (sk->netns_id != expected_id) return DROP; sk->bound_dev_if = idx; return OK; or if (sk->netns_id != expected_id) return OK; sk->bound_dev_if = idx; return OK; or it will be able to bpf_trace_printk() or bpf_perf_event_output() to send notification to vrf user space daemon and so on. Such checks will run at the same speed as checks inside __cgroup_bpf_run_filter_sk(), but all other users won't pay for them when not in use. imo it's a win-win. In parallel I'm planning to work on 'disallow override' flag for bpf_prog_attach. That should solve remaining concern. I still disagree about urgency of implementing it, but it's simple en
Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
Hello, Andy. On Tue, Jan 24, 2017 at 10:54:03AM -0800, Andy Lutomirski wrote: > Tejun, I can see two basic ways that cgroup+bpf delegation could work > down the road. Both depend on unprivileged users being able to load > BPF programs of the correct type, but that's mostly a matter of > auditing the code and doesn't have any particularly interesting design > challenges as far as I know. > > Approach 1: Scope cgroup+bpf hooks to a netns and require > ns_capable(CAP_NET_ADMIN), fs permissions, and (optionally) delegation > to be enabled to install them. This shouldn't have any particularly > dangerous security implications because ns_capable(CAP_NET_ADMIN) > means you already own the netns. > > Approach 2: Keep cgroup+bpf hooks global. This makes the delegation > story much trickier. There needs to be some mechanism to prevent some > program that has delegated cgroup control from affecting the behavior > of outside programs. This isn't so easy. Imagine that you've > delegated /cgroup/foo to UID 1000. A program with UID 1000 can now > install a hook on /cgroup/foo, but there needs to be a mechanism to > prevent UID 1000 from running in /cgroup/foo in the global netns and > then running a tool like sudo. This *might* be possible using just > existing cgroup mechanisms, but it's going to be quite tricky to make > sure it's done completely correctly and without dangerous races. > > If cgroup+bpf stays global, then I think you're basically committing > to approach 2. > > I would suggest doing approach 1 (i.e. apply my patch) and then, if > truly needed for some use case, add an option so that globally > privileged programs can create hooks that affect all namespaces. I thought about restricting according to namespaces and your ifindex example. While the ifindex issue seems broken and on the surface seems to indicate that we need to segregate cgroup bpf programs per netns as iptables does, the more I think about it, the less convinced I'm that that is the only right way forward. I'm not too familiar with netns but if you think about other namespaces, !root namespaces nest inside the root one. When you enter a pid namespace, it adds its pid namespace on top of the global one. It doesn't replace that, and that is an important characteristic which allows the root namespace to maintain control over the nested ones. This is the same with cgroupns, which basically is just nesting, or mount namespace. Namespaced objects don't disappear in the eyes of the root namespace. They just get additional scoped names. Back to netns, if I'm understanding correctly, this doesn't seem to be the case. An interface belongs to a namespace and can be moved around but it disappears from the root namespace once it enters a non-root one. I guess there are reasons, even if historical, for it but this is what is breaking your ifindex example. It's fine for an interface to get a new scoped ifindex inside a netns, but that shouldn't override the global one. This, I think, should be fixable without breaking existing APIs by introducing a new global index. My opinion on this isn't very strong but what iptables is doing actually seems wrong to me in that it actually precludes any way of implementing global policies which encompasses the whole system. Again, this is a fixable problem if ever necessary by introducing properly global rule tables or even just adding a flag. So, I don't think the situation is as clear as "ifindex is broken, so cgroup bpf programs should be segregated per netns". David, in another reply, you mentioned about fixing ifindex. If I understood correctly, we're talking about the same thing, right? If so, is this something easy to implement? Thanks. -- tejun
Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
On 1/24/17 2:24 PM, Andy Lutomirski wrote: > I was hoping for an actual likely use case for the bpf hooks to be run > in all namespaces. You're arguing that iproute2 can be made to work > mostly okay if bpf hooks can run in all namespaces, but the use case > of intentionally making sk_bound_dev_if invalid across all namespaces > seems dubious. you can use the bpf hook to deny socket create based on family and/or protocol.
Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
On Tue, Jan 24, 2017 at 12:29 PM, David Ahern wrote: > > Users do not run around exec'ing commands in random network contexts > (namespace, vrf, device, whatever) and expect them to just work. I worked on some code once (deployed in production, even!) that calls unshare() to make a netns and creates an interface and runs code in there and expects it to just work. It wouldn't work the outer program were run under current ip vrf. > >> >> Maybe you can argue that this is a missing feature in cgroup+bpf (no >> API to query which netns is in use) and a corresponding bug in 'ip >> vrf', but I see this as evidence that cgroup+bpf as it exists in 4.10 >> is not carefully enough throught through. The only non-example user >> of it that I can find (ip vrf) is buggy and can't really be fixed >> using mechanisms that exist in 4.10-rc. > > The argument is that cgroups and namespaces are completely disjoint > infrastructure and that users need to know what they are doing. But perhaps they should be less disjoint. As far as I know, cgroup+bpf is the *only* network configuration that is being set up to run across all network namespaces. No one has said why this is okay, let alone why it's preferable to making it work per-netns just like basically all other Linux network configuration. > >> >>> things up so that unshare will malfunction. It should avoid malfunctioning when running Linux programs that are unaware of it. >>> >>> I agree that for VRF use case it will help to make programs netns >>> aware by adding new bpf_get_current_netns_id() or something helper, >>> but it's up to the program to function properly or be broken. >> >> This will cause David's code to run slower, and I think he wants very >> high performance. > > This is a socket create path not a packet path. While overhead should be > contained, a few extra cycles should be fine. > > Adding the capability to allow users to check the netns id would offer a > solution to the namespace problem, but there is nothing that *requires* a bpf > program to do it. > > Who's to say an admin does not *want* all processes in a group to have > sockets bound to a non-existent device? 'ip vrf' restricts the device index > to a VRF device because as a management tool I want it to be user friendly, > but generically the BPF code does not have any restrictions. ifindex can be > any u32 value. I was hoping for an actual likely use case for the bpf hooks to be run in all namespaces. You're arguing that iproute2 can be made to work mostly okay if bpf hooks can run in all namespaces, but the use case of intentionally making sk_bound_dev_if invalid across all namespaces seems dubious. But all of what you're suggesting doing would still work fine and would result in less kernel code *and* less eBPF code if the hooks were per netns. --Andy
Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
On 1/24/17 11:54 AM, Andy Lutomirski wrote: >>> My point is that, in 4.10-rc, it doesn't work right, and I doubt this >>> problem is restricted to just 'ip vrf'. Without some kind of change >>> to the way that netns and cgroup+bpf interact, anything that uses >>> sk_bound_dev_if or reads the ifindex on an skb will be subject to a >>> huge footgun that unprivileged programs can trigger and any future >>> attempt to make the cgroup+bpf work for unprivileged users is going to >>> be more complicated than it deserves to be. >> >> For n-th time, the current BPF_PROG_TYPE_CGROUP* is root only and >> speculation about unprivileged usage are not helping the discussion. > > ...which has nothing to do with my example of how it's broken. I used > 'ip vrf' the way it was intended to be used and then I ran code in it > that uses only APIs that predate eBPF. The result was that the > program obtained a broken socket that had sk_bound_dev_if filled in > with a nonsense index. There's no speculation -- it's just broken. Again, there are many ways to arrive at this point where a socket is bound to a device index that is invalid. Nothing stops you from downloading a bpf program with an invalid ifindex. That is up to the user. Users do not run around exec'ing commands in random network contexts (namespace, vrf, device, whatever) and expect them to just work. > > Maybe you can argue that this is a missing feature in cgroup+bpf (no > API to query which netns is in use) and a corresponding bug in 'ip > vrf', but I see this as evidence that cgroup+bpf as it exists in 4.10 > is not carefully enough throught through. The only non-example user > of it that I can find (ip vrf) is buggy and can't really be fixed > using mechanisms that exist in 4.10-rc. The argument is that cgroups and namespaces are completely disjoint infrastructure and that users need to know what they are doing. If one uses a management tool (ip in this case) in a consistent manner it will work. If 'ip netns' is used to change namespaces, vrf is reset on the switch. And, I have a patch now to properly nest namespaces and vrfs so that mgmt vrf in multiple namespaces will work properly. > >> >>> things up so that unshare will malfunction. It should avoid >>> malfunctioning when running Linux programs that are unaware of it. >> >> I agree that for VRF use case it will help to make programs netns >> aware by adding new bpf_get_current_netns_id() or something helper, >> but it's up to the program to function properly or be broken. > > This will cause David's code to run slower, and I think he wants very > high performance. This is a socket create path not a packet path. While overhead should be contained, a few extra cycles should be fine. Adding the capability to allow users to check the netns id would offer a solution to the namespace problem, but there is nothing that *requires* a bpf program to do it. Who's to say an admin does not *want* all processes in a group to have sockets bound to a non-existent device? 'ip vrf' restricts the device index to a VRF device because as a management tool I want it to be user friendly, but generically the BPF code does not have any restrictions. ifindex can be any u32 value.
Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
On Tue, Jan 24, 2017 at 9:48 AM, Alexei Starovoitov wrote: > On Mon, Jan 23, 2017 at 08:32:02PM -0800, Andy Lutomirski wrote: >> On Mon, Jan 23, 2017 at 8:05 PM, David Ahern >> wrote: >> > On 1/23/17 8:37 PM, Andy Lutomirski wrote: >> >> Yes, it is a bug because cgroup+bpf causes unwitting programs to be >> >> subject to BPF code installed by a different, potentially unrelated >> >> process. That's a new situation. The failure can happen when a >> >> privileged supervisor (whoever runs ip vrf) runs a clueless or >> >> unprivileged program (the thing calling unshare()). >> > >> > There are many, many ways to misconfigure networking and to run programs >> > in a context or with an input argument that causes the program to not work >> > at all, not work as expected or stop working. This situation is no >> > different. >> > >> > For example, the only aspect of BPF_PROG_TYPE_CGROUP_SOCK filters that are >> > namespace based is the ifindex. You brought up the example of changing >> > namespaces where the ifindex is not defined. Alexei mentioned an example >> > where interfaces can be moved to another namespace breaking any ifindex >> > based programs. Another example is the interface can be deleted. Deleting >> > an interface with sockets bound to it does not impact the program in any >> > way - no notification, no wakeup, nothing. The sockets just don't work. >> >> And if you use 'ip vrf' to bind to a vrf with ifindex 4 and a program >> unshares netns and creates an interface with ifindex 4, then that >> program will end up with its sockets magically bound to ifindex 4 and >> will silently malfunction. >> >> I can think of multiple ways to address this problem. You could scope >> the hooks to a netns (which is sort of what my patch does). You could >> find a way to force programs in a given cgroup to only execute in a >> single netns, although that would probably cause other breakage. You >> could improve the BPF hook API to be netns-aware, which could plausbly >> address issues related to unshare() but might get very tricky when >> setns() is involved. > > scoping cgroup to netns will create weird combination of cgroup and netns > which was never done before. cgroup and netns scopes have to be able > to overlap in arbitrary way. Application shouldn't not be able to escape > cgroup scope by changing netns. I had assumed that too, but I'm not longer at all convinced that this is a problem. It's certainly the case that, if you put an application into a restrictive cgroup, it shouldn't be able to bypass those restrictions using unshare() or setns(). But I don't think this is really possible regardless. The easy way to try to escape is using unshare(), but this doesn't actually buy you anything. $ unshare -Urn ip link 1: lo: mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 Maybe I just escaped the cgroup restrictions on ingress and egress, but that doesn't seem to matter because I also no longer have any interfaces that have any traffic. And, if I created a socket before unsharing, it's still safe because that socket is bound to the old netns and would still be subject to the cgroup rules with my patch applied. setns() is a bit more complicated, but it should still be fine. netns_install() requires CAP_NET_ADMIN over the target netns, so you can only switch in to a netns if you already have privilege in that netns. >> My point is that, in 4.10-rc, it doesn't work right, and I doubt this >> problem is restricted to just 'ip vrf'. Without some kind of change >> to the way that netns and cgroup+bpf interact, anything that uses >> sk_bound_dev_if or reads the ifindex on an skb will be subject to a >> huge footgun that unprivileged programs can trigger and any future >> attempt to make the cgroup+bpf work for unprivileged users is going to >> be more complicated than it deserves to be. > > For n-th time, the current BPF_PROG_TYPE_CGROUP* is root only and > speculation about unprivileged usage are not helping the discussion. ...which has nothing to do with my example of how it's broken. I used 'ip vrf' the way it was intended to be used and then I ran code in it that uses only APIs that predate eBPF. The result was that the program obtained a broken socket that had sk_bound_dev_if filled in with a nonsense index. There's no speculation -- it's just broken. Maybe you can argue that this is a missing feature in cgroup+bpf (no API to query which netns is in use) and a corresponding bug in 'ip vrf', but I see this as evidence that cgroup+bpf as it exists in 4.10 is not carefully enough throught through. The only non-example user of it that I can find (ip vrf) is buggy and can't really be fixed using mechanisms that exist in 4.10-rc. > >> things up so that unshare will malfunction. It should avoid >> malfunctioning when running Linux programs that are unaware of it. > > I agree that for VRF use case it will help t
Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
On Mon, Jan 23, 2017 at 08:32:02PM -0800, Andy Lutomirski wrote: > On Mon, Jan 23, 2017 at 8:05 PM, David Ahern wrote: > > On 1/23/17 8:37 PM, Andy Lutomirski wrote: > >> Yes, it is a bug because cgroup+bpf causes unwitting programs to be > >> subject to BPF code installed by a different, potentially unrelated > >> process. That's a new situation. The failure can happen when a > >> privileged supervisor (whoever runs ip vrf) runs a clueless or > >> unprivileged program (the thing calling unshare()). > > > > There are many, many ways to misconfigure networking and to run programs in > > a context or with an input argument that causes the program to not work at > > all, not work as expected or stop working. This situation is no different. > > > > For example, the only aspect of BPF_PROG_TYPE_CGROUP_SOCK filters that are > > namespace based is the ifindex. You brought up the example of changing > > namespaces where the ifindex is not defined. Alexei mentioned an example > > where interfaces can be moved to another namespace breaking any ifindex > > based programs. Another example is the interface can be deleted. Deleting > > an interface with sockets bound to it does not impact the program in any > > way - no notification, no wakeup, nothing. The sockets just don't work. > > And if you use 'ip vrf' to bind to a vrf with ifindex 4 and a program > unshares netns and creates an interface with ifindex 4, then that > program will end up with its sockets magically bound to ifindex 4 and > will silently malfunction. > > I can think of multiple ways to address this problem. You could scope > the hooks to a netns (which is sort of what my patch does). You could > find a way to force programs in a given cgroup to only execute in a > single netns, although that would probably cause other breakage. You > could improve the BPF hook API to be netns-aware, which could plausbly > address issues related to unshare() but might get very tricky when > setns() is involved. scoping cgroup to netns will create weird combination of cgroup and netns which was never done before. cgroup and netns scopes have to be able to overlap in arbitrary way. Application shouldn't not be able to escape cgroup scope by changing netns. They are orthogonal scoping constructs. What we can do instead is to do add bpf helper function to retrieve the netns id, so that program created for specific netns and specific ifindex will be more resilient to netns changes. Note that this ifindex behavior has been present since classic bpf and its shortcomings are understood. We need to improve it for all bpf users. It really has nothing to do with cgroups. cls_bpf filters can do packet redirect into ifindex and they can break if user process changes. The 'malfunction' scenario describe above is no different than cls_bpf is using bpf_skb_under_cgroup() to scope particular skb->sk->cgroup and doing redirect into ifindex. It will 'malfunction' exactly the same way if application stays in the same cgroup, but moves from one netns into another. > My point is that, in 4.10-rc, it doesn't work right, and I doubt this > problem is restricted to just 'ip vrf'. Without some kind of change > to the way that netns and cgroup+bpf interact, anything that uses > sk_bound_dev_if or reads the ifindex on an skb will be subject to a > huge footgun that unprivileged programs can trigger and any future > attempt to make the cgroup+bpf work for unprivileged users is going to > be more complicated than it deserves to be. For n-th time, the current BPF_PROG_TYPE_CGROUP* is root only and speculation about unprivileged usage are not helping the discussion. > things up so that unshare will malfunction. It should avoid > malfunctioning when running Linux programs that are unaware of it. I agree that for VRF use case it will help to make programs netns aware by adding new bpf_get_current_netns_id() or something helper, but it's up to the program to function properly or be broken. Adding weird netns restrictions is not going to make bpf programs any more sane. The program author can always shoot in the foot. I will work on the patch to add that.
Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
On Mon, Jan 23, 2017 at 8:05 PM, David Ahern wrote: > On 1/23/17 8:37 PM, Andy Lutomirski wrote: >> Yes, it is a bug because cgroup+bpf causes unwitting programs to be >> subject to BPF code installed by a different, potentially unrelated >> process. That's a new situation. The failure can happen when a >> privileged supervisor (whoever runs ip vrf) runs a clueless or >> unprivileged program (the thing calling unshare()). > > There are many, many ways to misconfigure networking and to run programs in a > context or with an input argument that causes the program to not work at all, > not work as expected or stop working. This situation is no different. > > For example, the only aspect of BPF_PROG_TYPE_CGROUP_SOCK filters that are > namespace based is the ifindex. You brought up the example of changing > namespaces where the ifindex is not defined. Alexei mentioned an example > where interfaces can be moved to another namespace breaking any ifindex based > programs. Another example is the interface can be deleted. Deleting an > interface with sockets bound to it does not impact the program in any way - > no notification, no wakeup, nothing. The sockets just don't work. And if you use 'ip vrf' to bind to a vrf with ifindex 4 and a program unshares netns and creates an interface with ifindex 4, then that program will end up with its sockets magically bound to ifindex 4 and will silently malfunction. I can think of multiple ways to address this problem. You could scope the hooks to a netns (which is sort of what my patch does). You could find a way to force programs in a given cgroup to only execute in a single netns, although that would probably cause other breakage. You could improve the BPF hook API to be netns-aware, which could plausbly address issues related to unshare() but might get very tricky when setns() is involved. My point is that, in 4.10-rc, it doesn't work right, and I doubt this problem is restricted to just 'ip vrf'. Without some kind of change to the way that netns and cgroup+bpf interact, anything that uses sk_bound_dev_if or reads the ifindex on an skb will be subject to a huge footgun that unprivileged programs can trigger and any future attempt to make the cgroup+bpf work for unprivileged users is going to be more complicated than it deserves to be. (Aside: I wonder if a similar goal to 'ip vrf' could be achieved by moving the vrf to a different network namespace and putting programs that you want to be subject to the VRF into that namespace.) > > The point of using a management tool like ip is to handle the details to make > things sane -- which is the consistent with the whole point of ip netns vs > running unshare -n. I still don't understand how you're going to make 'ip netns' make this problem go away. Keep in mind that there are real programs that call the unshare() syscall directly. > >> >> If the way cgroup+bpf worked was that a program did a prctl() or >> similar to opt in to being managed by a provided cgroup-aware BPF >> program, then I'd agree with everything you're saying. But that's not >> at all how this code works. > > I don't want an opt-in approach. The program does not need to know or even > care that it has some restriction. Today, someone runs 'ip netns exec NAME > COMMAND' the command does not need to know or care that the network namespace > was changed on it. Same with 'ip vrf exec' -- it is a network policy that is > forced on the program by the user. I absolutely agree. My point is that cgroup+bpf is *not* opt-in, so the bar should be higher. You can't blame the evil program that called unshare() for breaking things when 'ip vrf' unavoidably sets things up so that unshare will malfunction. It should avoid malfunctioning when running Linux programs that are unaware of it. This should include programs like unshare and 'ip netns'.
Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
On Mon, Jan 23, 2017 at 8:10 PM, David Ahern wrote: > On 1/23/17 7:39 PM, Andy Lutomirski wrote: >> I'm not sure I followed what you meant. If I understood right (which >> is a big if) you're saying that ip vrf when run in a netns works >> correctly in that netns. I agree, but I think that it would continue >> to work (even more reliably) if the hooks only executed in the netns >> in which they were created. I think that would probably be a good >> improvement, and it could be done on top of my patch, but I'm not >> about to write that code for 4.10. > > Sounds like an efficiency on the implementation that does not require > limiting the code today to just init namespace. > It's an ABI change, not an implementation change. If someone wants to make the code fully netns-aware in time for 4.10, that sounds great. I'm not going to do that.
Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
On 1/23/17 8:37 PM, Andy Lutomirski wrote: > Yes, it is a bug because cgroup+bpf causes unwitting programs to be > subject to BPF code installed by a different, potentially unrelated > process. That's a new situation. The failure can happen when a > privileged supervisor (whoever runs ip vrf) runs a clueless or > unprivileged program (the thing calling unshare()). There are many, many ways to misconfigure networking and to run programs in a context or with an input argument that causes the program to not work at all, not work as expected or stop working. This situation is no different. For example, the only aspect of BPF_PROG_TYPE_CGROUP_SOCK filters that are namespace based is the ifindex. You brought up the example of changing namespaces where the ifindex is not defined. Alexei mentioned an example where interfaces can be moved to another namespace breaking any ifindex based programs. Another example is the interface can be deleted. Deleting an interface with sockets bound to it does not impact the program in any way - no notification, no wakeup, nothing. The sockets just don't work. The point of using a management tool like ip is to handle the details to make things sane -- which is the consistent with the whole point of ip netns vs running unshare -n. > > If the way cgroup+bpf worked was that a program did a prctl() or > similar to opt in to being managed by a provided cgroup-aware BPF > program, then I'd agree with everything you're saying. But that's not > at all how this code works. I don't want an opt-in approach. The program does not need to know or even care that it has some restriction. Today, someone runs 'ip netns exec NAME COMMAND' the command does not need to know or care that the network namespace was changed on it. Same with 'ip vrf exec' -- it is a network policy that is forced on the program by the user.
Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
On 1/23/17 7:39 PM, Andy Lutomirski wrote: > I'm not sure I followed what you meant. If I understood right (which > is a big if) you're saying that ip vrf when run in a netns works > correctly in that netns. I agree, but I think that it would continue > to work (even more reliably) if the hooks only executed in the netns > in which they were created. I think that would probably be a good > improvement, and it could be done on top of my patch, but I'm not > about to write that code for 4.10. Sounds like an efficiency on the implementation that does not require limiting the code today to just init namespace.
Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
On Mon, Jan 23, 2017 at 7:13 PM, Alexei Starovoitov wrote: > On Mon, Jan 23, 2017 at 06:42:27PM -0800, Andy Lutomirski wrote: >> Please explain how the change results in a broken ABI and how the >> current ABI is better. I gave a fully worked out example of how the >> current ABI misbehaves using only standard Linux networking tools. > > I gave an example in the other thread that demonstrated > cgroup escape by the appliction when it changes netns. > If application became part of cgroup, it has to stay there, > no matter setns() calls afterwards. The other thread is out of control. Can you restate your example? Please keep in mind that uprivileged programs can unshare their netns. > >> > >> >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c >> >> index e89acea22ecf..c0bbc55e244d 100644 >> >> --- a/kernel/bpf/syscall.c >> >> +++ b/kernel/bpf/syscall.c >> >> @@ -902,6 +902,17 @@ static int bpf_prog_attach(const union bpf_attr >> >> *attr) >> >> struct cgroup *cgrp; >> >> enum bpf_prog_type ptype; >> >> >> >> + /* >> >> + * For now, socket bpf hooks attached to cgroups can only be >> >> + * installed in the init netns and only affect the init netns. >> >> + * This could be relaxed in the future once some semantic issues >> >> + * are resolved. For example, ifindexes belonging to one netns >> >> + * should probably not be visible to hooks installed by programs >> >> + * running in a different netns. >> > >> > the comment is bogus and shows complete lack of understanding >> > how bpf is working and how it's being used. >> > See SKF_AD_IFINDEX that was in classic bpf forever. >> > >> >> I think I disagree completely. >> >> With classic BPF, a program creates a socket and binds a bpf program >> to it. That program can see the ifindex, which is fine because that >> ifindex is scoped to the socket's netns, which is (unless the caller >> uses setns() or unshare()) the same as the caller's netns, so the >> ifindex means exactly what the caller thinks it means. >> >> With cgroup+bpf, the program installing the hook can be completely >> unrelated to the program whose sockets are subject to the hook, and, >> if they're using different namespaces, it breaks. > > Please also see ingress_ifindex, ifindex, bpf_redirect(), bpf_clone_redirect() > that all operate on the ifindex while the program can be detached from > the application. In general bpf program and application that loaded and > attached it are mostly disjoint in case of networking. We have tail_call > mechanism and master bpf prog may call programs installed by other apps > that may have exited already. If program A acquires a BPF object from program B where program B runs in a different netns from program A, and program B uses or tail-calls that BPF object, then A is either doing it intentionally and is netns-aware or it is terminally buggy and deserves what it gets. > cls_bpf is scoped by netdev that belongs to some netns. > If after attaching a program with hardcoded if(ifindex==3) check > to such netdev, this netdev is moved into different netns, this 'if' check > and the program become bogus and won't do what program author > expected. It is expected behavior. The same thing with current 'ip vrf' > implementation that Dave is adjusting. It's not a bug in cgroup+bpf behavior. > Yes, it is a bug because cgroup+bpf causes unwitting programs to be subject to BPF code installed by a different, potentially unrelated process. That's a new situation. The failure can happen when a privileged supervisor (whoever runs ip vrf) runs a clueless or unprivileged program (the thing calling unshare()). If the way cgroup+bpf worked was that a program did a prctl() or similar to opt in to being managed by a provided cgroup-aware BPF program, then I'd agree with everything you're saying. But that's not at all how this code works.
Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
On Mon, Jan 23, 2017 at 06:42:27PM -0800, Andy Lutomirski wrote: > On Mon, Jan 23, 2017 at 6:09 PM, Alexei Starovoitov > wrote: > > On Mon, Jan 23, 2017 at 12:36:08PM -0800, Andy Lutomirski wrote: > >> To see how cgroup+bpf interacts with network namespaces, I wrote a > >> little program called show_bind that calls getsockopt(..., > >> SO_BINDTODEVICE, ...) and prints the result. It did this: > >> > >> # ./ip link add dev vrf0 type vrf table 10 > >> # ./ip vrf exec vrf0 ./show_bind > >> Default binding is "vrf0" > >> # ./ip vrf exec vrf0 unshare -n ./show_bind > >> show_bind: getsockopt: No such device > >> > >> What's happening here is that "ip vrf" looks up vrf0's ifindex in > >> the init netns and installs a hook that binds sockets to that > >> ifindex. When the hook runs in a different netns, it sets > >> sk_bound_dev_if to an ifindex from the wrong netns, resulting in > >> incorrect behavior. In this particular example, the ifindex was 4 > >> and there was no ifindex 4 in the new netns. If there had been, > >> this test would have malfunctioned differently > >> > >> Since it's rather late in the release cycle, let's punt. This patch > >> makes it impossible to install cgroup+bpf hooks outside the init > >> netns and makes them not run on sockets that aren't in the init > >> netns. > >> > >> In a future release, it should be relatively straightforward to make > >> these hooks be local to a netns and, if needed, to add a flag so > >> that hooks can be made global if necessary. Global hooks should > >> presumably be constrained so that they can't write to any ifindex > >> fields. > >> > >> Cc: Daniel Borkmann > >> Cc: Alexei Starovoitov > >> Cc: David Ahern > >> Signed-off-by: Andy Lutomirski > >> --- > >> > >> DaveM, this mitigates a bug in a feature that's new in 4.10, and the > >> bug can be hit using current iproute2 -git. please consider this for > >> -net. > >> > >> Changes from v1: > >> - Fix the commit message. 'git commit' was very clever and thought that > >>all the interesting bits of the test case were intended to be comments > >>and stripped them. Whoops! > >> > >> kernel/bpf/cgroup.c | 21 + > >> kernel/bpf/syscall.c | 11 +++ > >> 2 files changed, 32 insertions(+) > >> > >> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c > >> index a515f7b007c6..a824f543de69 100644 > >> --- a/kernel/bpf/cgroup.c > >> +++ b/kernel/bpf/cgroup.c > >> @@ -143,6 +143,17 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk, > >> if (!sk || !sk_fullsock(sk)) > >> return 0; > >> > >> + /* > >> + * For now, socket bpf hooks attached to cgroups can only be > >> + * installed in the init netns and only affect the init netns. > >> + * This could be relaxed in the future once some semantic issues > >> + * are resolved. For example, ifindexes belonging to one netns > >> + * should probably not be visible to hooks installed by programs > >> + * running in a different netns. > >> + */ > >> + if (sock_net(sk) != &init_net) > >> + return 0; > > > > Nack. > > Such check will create absolutely broken abi that we won't be able to fix > > later. > > Please explain how the change results in a broken ABI and how the > current ABI is better. I gave a fully worked out example of how the > current ABI misbehaves using only standard Linux networking tools. I gave an example in the other thread that demonstrated cgroup escape by the appliction when it changes netns. If application became part of cgroup, it has to stay there, no matter setns() calls afterwards. > > > >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > >> index e89acea22ecf..c0bbc55e244d 100644 > >> --- a/kernel/bpf/syscall.c > >> +++ b/kernel/bpf/syscall.c > >> @@ -902,6 +902,17 @@ static int bpf_prog_attach(const union bpf_attr *attr) > >> struct cgroup *cgrp; > >> enum bpf_prog_type ptype; > >> > >> + /* > >> + * For now, socket bpf hooks attached to cgroups can only be > >> + * installed in the init netns and only affect the init netns. > >> + * This could be relaxed in the future once some semantic issues > >> + * are resolved. For example, ifindexes belonging to one netns > >> + * should probably not be visible to hooks installed by programs > >> + * running in a different netns. > > > > the comment is bogus and shows complete lack of understanding > > how bpf is working and how it's being used. > > See SKF_AD_IFINDEX that was in classic bpf forever. > > > > I think I disagree completely. > > With classic BPF, a program creates a socket and binds a bpf program > to it. That program can see the ifindex, which is fine because that > ifindex is scoped to the socket's netns, which is (unless the caller > uses setns() or unshare()) the same as the caller's netns, so the > ifindex means exactly what the caller thinks it means. > > With cgroup+bpf, the program installing
Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
On Mon, Jan 23, 2017 at 6:09 PM, Alexei Starovoitov wrote: > On Mon, Jan 23, 2017 at 12:36:08PM -0800, Andy Lutomirski wrote: >> To see how cgroup+bpf interacts with network namespaces, I wrote a >> little program called show_bind that calls getsockopt(..., >> SO_BINDTODEVICE, ...) and prints the result. It did this: >> >> # ./ip link add dev vrf0 type vrf table 10 >> # ./ip vrf exec vrf0 ./show_bind >> Default binding is "vrf0" >> # ./ip vrf exec vrf0 unshare -n ./show_bind >> show_bind: getsockopt: No such device >> >> What's happening here is that "ip vrf" looks up vrf0's ifindex in >> the init netns and installs a hook that binds sockets to that >> ifindex. When the hook runs in a different netns, it sets >> sk_bound_dev_if to an ifindex from the wrong netns, resulting in >> incorrect behavior. In this particular example, the ifindex was 4 >> and there was no ifindex 4 in the new netns. If there had been, >> this test would have malfunctioned differently >> >> Since it's rather late in the release cycle, let's punt. This patch >> makes it impossible to install cgroup+bpf hooks outside the init >> netns and makes them not run on sockets that aren't in the init >> netns. >> >> In a future release, it should be relatively straightforward to make >> these hooks be local to a netns and, if needed, to add a flag so >> that hooks can be made global if necessary. Global hooks should >> presumably be constrained so that they can't write to any ifindex >> fields. >> >> Cc: Daniel Borkmann >> Cc: Alexei Starovoitov >> Cc: David Ahern >> Signed-off-by: Andy Lutomirski >> --- >> >> DaveM, this mitigates a bug in a feature that's new in 4.10, and the >> bug can be hit using current iproute2 -git. please consider this for >> -net. >> >> Changes from v1: >> - Fix the commit message. 'git commit' was very clever and thought that >>all the interesting bits of the test case were intended to be comments >>and stripped them. Whoops! >> >> kernel/bpf/cgroup.c | 21 + >> kernel/bpf/syscall.c | 11 +++ >> 2 files changed, 32 insertions(+) >> >> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c >> index a515f7b007c6..a824f543de69 100644 >> --- a/kernel/bpf/cgroup.c >> +++ b/kernel/bpf/cgroup.c >> @@ -143,6 +143,17 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk, >> if (!sk || !sk_fullsock(sk)) >> return 0; >> >> + /* >> + * For now, socket bpf hooks attached to cgroups can only be >> + * installed in the init netns and only affect the init netns. >> + * This could be relaxed in the future once some semantic issues >> + * are resolved. For example, ifindexes belonging to one netns >> + * should probably not be visible to hooks installed by programs >> + * running in a different netns. >> + */ >> + if (sock_net(sk) != &init_net) >> + return 0; > > Nack. > Such check will create absolutely broken abi that we won't be able to fix > later. Please explain how the change results in a broken ABI and how the current ABI is better. I gave a fully worked out example of how the current ABI misbehaves using only standard Linux networking tools. > >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c >> index e89acea22ecf..c0bbc55e244d 100644 >> --- a/kernel/bpf/syscall.c >> +++ b/kernel/bpf/syscall.c >> @@ -902,6 +902,17 @@ static int bpf_prog_attach(const union bpf_attr *attr) >> struct cgroup *cgrp; >> enum bpf_prog_type ptype; >> >> + /* >> + * For now, socket bpf hooks attached to cgroups can only be >> + * installed in the init netns and only affect the init netns. >> + * This could be relaxed in the future once some semantic issues >> + * are resolved. For example, ifindexes belonging to one netns >> + * should probably not be visible to hooks installed by programs >> + * running in a different netns. > > the comment is bogus and shows complete lack of understanding > how bpf is working and how it's being used. > See SKF_AD_IFINDEX that was in classic bpf forever. > I think I disagree completely. With classic BPF, a program creates a socket and binds a bpf program to it. That program can see the ifindex, which is fine because that ifindex is scoped to the socket's netns, which is (unless the caller uses setns() or unshare()) the same as the caller's netns, so the ifindex means exactly what the caller thinks it means. With cgroup+bpf, the program installing the hook can be completely unrelated to the program whose sockets are subject to the hook, and, if they're using different namespaces, it breaks. --Andy
Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
On Mon, Jan 23, 2017 at 6:31 PM, David Ahern wrote: > On 1/23/17 7:09 PM, Alexei Starovoitov wrote: >>> + */ >>> +if (current->nsproxy->net_ns != &init_net) >>> +return -EINVAL; >> >> this restriction I actually don't mind, since it indeed can be >> relaxed later, but please make it proper with net_eq() >> > > I do mind. Why have different restrictions for the skb and sk filters? > > I have already shown that ip can alleviate the management aspects of the > implementation -- just like ip netns does. I'm not sure I followed what you meant. If I understood right (which is a big if) you're saying that ip vrf when run in a netns works correctly in that netns. I agree, but I think that it would continue to work (even more reliably) if the hooks only executed in the netns in which they were created. I think that would probably be a good improvement, and it could be done on top of my patch, but I'm not about to write that code for 4.10. --Andy
Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
On 1/23/17 7:09 PM, Alexei Starovoitov wrote: >> + */ >> +if (current->nsproxy->net_ns != &init_net) >> +return -EINVAL; > > this restriction I actually don't mind, since it indeed can be > relaxed later, but please make it proper with net_eq() > I do mind. Why have different restrictions for the skb and sk filters? I have already shown that ip can alleviate the management aspects of the implementation -- just like ip netns does.
Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
On Mon, Jan 23, 2017 at 12:36:08PM -0800, Andy Lutomirski wrote: > To see how cgroup+bpf interacts with network namespaces, I wrote a > little program called show_bind that calls getsockopt(..., > SO_BINDTODEVICE, ...) and prints the result. It did this: > > # ./ip link add dev vrf0 type vrf table 10 > # ./ip vrf exec vrf0 ./show_bind > Default binding is "vrf0" > # ./ip vrf exec vrf0 unshare -n ./show_bind > show_bind: getsockopt: No such device > > What's happening here is that "ip vrf" looks up vrf0's ifindex in > the init netns and installs a hook that binds sockets to that > ifindex. When the hook runs in a different netns, it sets > sk_bound_dev_if to an ifindex from the wrong netns, resulting in > incorrect behavior. In this particular example, the ifindex was 4 > and there was no ifindex 4 in the new netns. If there had been, > this test would have malfunctioned differently > > Since it's rather late in the release cycle, let's punt. This patch > makes it impossible to install cgroup+bpf hooks outside the init > netns and makes them not run on sockets that aren't in the init > netns. > > In a future release, it should be relatively straightforward to make > these hooks be local to a netns and, if needed, to add a flag so > that hooks can be made global if necessary. Global hooks should > presumably be constrained so that they can't write to any ifindex > fields. > > Cc: Daniel Borkmann > Cc: Alexei Starovoitov > Cc: David Ahern > Signed-off-by: Andy Lutomirski > --- > > DaveM, this mitigates a bug in a feature that's new in 4.10, and the > bug can be hit using current iproute2 -git. please consider this for > -net. > > Changes from v1: > - Fix the commit message. 'git commit' was very clever and thought that >all the interesting bits of the test case were intended to be comments >and stripped them. Whoops! > > kernel/bpf/cgroup.c | 21 + > kernel/bpf/syscall.c | 11 +++ > 2 files changed, 32 insertions(+) > > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c > index a515f7b007c6..a824f543de69 100644 > --- a/kernel/bpf/cgroup.c > +++ b/kernel/bpf/cgroup.c > @@ -143,6 +143,17 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk, > if (!sk || !sk_fullsock(sk)) > return 0; > > + /* > + * For now, socket bpf hooks attached to cgroups can only be > + * installed in the init netns and only affect the init netns. > + * This could be relaxed in the future once some semantic issues > + * are resolved. For example, ifindexes belonging to one netns > + * should probably not be visible to hooks installed by programs > + * running in a different netns. > + */ > + if (sock_net(sk) != &init_net) > + return 0; Nack. Such check will create absolutely broken abi that we won't be able to fix later. > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index e89acea22ecf..c0bbc55e244d 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -902,6 +902,17 @@ static int bpf_prog_attach(const union bpf_attr *attr) > struct cgroup *cgrp; > enum bpf_prog_type ptype; > > + /* > + * For now, socket bpf hooks attached to cgroups can only be > + * installed in the init netns and only affect the init netns. > + * This could be relaxed in the future once some semantic issues > + * are resolved. For example, ifindexes belonging to one netns > + * should probably not be visible to hooks installed by programs > + * running in a different netns. the comment is bogus and shows complete lack of understanding how bpf is working and how it's being used. See SKF_AD_IFINDEX that was in classic bpf forever. > + */ > + if (current->nsproxy->net_ns != &init_net) > + return -EINVAL; this restriction I actually don't mind, since it indeed can be relaxed later, but please make it proper with net_eq()
Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
On Mon, Jan 23, 2017 at 1:03 PM, David Ahern wrote: > On 1/23/17 1:36 PM, Andy Lutomirski wrote: >> To see how cgroup+bpf interacts with network namespaces, I wrote a >> little program called show_bind that calls getsockopt(..., >> SO_BINDTODEVICE, ...) and prints the result. It did this: >> >> # ./ip link add dev vrf0 type vrf table 10 >> # ./ip vrf exec vrf0 ./show_bind >> Default binding is "vrf0" >> # ./ip vrf exec vrf0 unshare -n ./show_bind >> show_bind: getsockopt: No such device >> >> What's happening here is that "ip vrf" looks up vrf0's ifindex in >> the init netns and installs a hook that binds sockets to that > > It looks up the device name in the current namespace. > >> ifindex. When the hook runs in a different netns, it sets >> sk_bound_dev_if to an ifindex from the wrong netns, resulting in >> incorrect behavior. In this particular example, the ifindex was 4 >> and there was no ifindex 4 in the new netns. If there had been, >> this test would have malfunctioned differently > > While the cgroups and network namespace interaction needs improvement, a > management tool can workaround the deficiencies: > > A shell in the default namespace, mgmt vrf (PS1 tells me the network context): > dsa@kenny:mgmt:~$ > > Switch to a different namespace (one that I run VMs for network testing): > dsa@kenny:mgmt:~$ sudo ip netns exec vms su - dsa > > And then bind the shell to vrf2 > dsa@kenny:vms:~$ sudo ip vrf exec vrf2 su - dsa > dsa@kenny:vms:vrf2:~$ > > Or I can go straight to vrf2: > dsa@kenny:mgmt:~$ sudo ip netns exec vms ip vrf exec vrf2 su - dsa > dsa@kenny:vms:vrf2:~$ Indeed, if you're careful to set up the vrf cgroup in the same netns that you end up using it in, it'll work. But there's a bigger footgun there than I think is warranted, and I'm not sure how iproute2 is supposed to do all that much better given that the eBPF program can neither see what namespace a socket is bound to nor can it act in a way that works correctly in any namespace. Long-term, I think the real fix is to make the hook work on a per-netns basis and, if needed, add an interface for a cross-netns hook to work sensibly. But I think it's a bit late to do that for 4.10, so instead I'm proposing to limit the API to the case where it works and the semantics are unambiguous and to leave further improvements for later. It's a bit unfortunate that there seems to be an impedance mismatch in that "ip vrf" acts on cgroups and that cgroups are somewhat orthogonal to network namespaces. > > > I am testing additional iproute2 cleanups which will be sent before 4.10 is > released. > > -8<- > >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c >> index e89acea22ecf..c0bbc55e244d 100644 >> --- a/kernel/bpf/syscall.c >> +++ b/kernel/bpf/syscall.c >> @@ -902,6 +902,17 @@ static int bpf_prog_attach(const union bpf_attr *attr) >> struct cgroup *cgrp; >> enum bpf_prog_type ptype; >> >> + /* >> + * For now, socket bpf hooks attached to cgroups can only be >> + * installed in the init netns and only affect the init netns. >> + * This could be relaxed in the future once some semantic issues >> + * are resolved. For example, ifindexes belonging to one netns >> + * should probably not be visible to hooks installed by programs >> + * running in a different netns. >> + */ >> + if (current->nsproxy->net_ns != &init_net) >> + return -EINVAL; >> + >> if (!capable(CAP_NET_ADMIN)) >> return -EPERM; >> > > But should this patch be taken, shouldn't the EPERM out rank the namespace > check. > I could see that going either way. If the hook becomes per-netns, then the capable() check could potentially become ns_capable() and it would start succeeding. I'd be happy to change it, though. --Andy -- Andy Lutomirski AMA Capital Management, LLC
Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
On 1/23/17 1:36 PM, Andy Lutomirski wrote: > To see how cgroup+bpf interacts with network namespaces, I wrote a > little program called show_bind that calls getsockopt(..., > SO_BINDTODEVICE, ...) and prints the result. It did this: > > # ./ip link add dev vrf0 type vrf table 10 > # ./ip vrf exec vrf0 ./show_bind > Default binding is "vrf0" > # ./ip vrf exec vrf0 unshare -n ./show_bind > show_bind: getsockopt: No such device > > What's happening here is that "ip vrf" looks up vrf0's ifindex in > the init netns and installs a hook that binds sockets to that It looks up the device name in the current namespace. > ifindex. When the hook runs in a different netns, it sets > sk_bound_dev_if to an ifindex from the wrong netns, resulting in > incorrect behavior. In this particular example, the ifindex was 4 > and there was no ifindex 4 in the new netns. If there had been, > this test would have malfunctioned differently While the cgroups and network namespace interaction needs improvement, a management tool can workaround the deficiencies: A shell in the default namespace, mgmt vrf (PS1 tells me the network context): dsa@kenny:mgmt:~$ Switch to a different namespace (one that I run VMs for network testing): dsa@kenny:mgmt:~$ sudo ip netns exec vms su - dsa And then bind the shell to vrf2 dsa@kenny:vms:~$ sudo ip vrf exec vrf2 su - dsa dsa@kenny:vms:vrf2:~$ Or I can go straight to vrf2: dsa@kenny:mgmt:~$ sudo ip netns exec vms ip vrf exec vrf2 su - dsa dsa@kenny:vms:vrf2:~$ I am testing additional iproute2 cleanups which will be sent before 4.10 is released. -8<- > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index e89acea22ecf..c0bbc55e244d 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -902,6 +902,17 @@ static int bpf_prog_attach(const union bpf_attr *attr) > struct cgroup *cgrp; > enum bpf_prog_type ptype; > > + /* > + * For now, socket bpf hooks attached to cgroups can only be > + * installed in the init netns and only affect the init netns. > + * This could be relaxed in the future once some semantic issues > + * are resolved. For example, ifindexes belonging to one netns > + * should probably not be visible to hooks installed by programs > + * running in a different netns. > + */ > + if (current->nsproxy->net_ns != &init_net) > + return -EINVAL; > + > if (!capable(CAP_NET_ADMIN)) > return -EPERM; > But should this patch be taken, shouldn't the EPERM out rank the namespace check.
[PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns
To see how cgroup+bpf interacts with network namespaces, I wrote a little program called show_bind that calls getsockopt(..., SO_BINDTODEVICE, ...) and prints the result. It did this: # ./ip link add dev vrf0 type vrf table 10 # ./ip vrf exec vrf0 ./show_bind Default binding is "vrf0" # ./ip vrf exec vrf0 unshare -n ./show_bind show_bind: getsockopt: No such device What's happening here is that "ip vrf" looks up vrf0's ifindex in the init netns and installs a hook that binds sockets to that ifindex. When the hook runs in a different netns, it sets sk_bound_dev_if to an ifindex from the wrong netns, resulting in incorrect behavior. In this particular example, the ifindex was 4 and there was no ifindex 4 in the new netns. If there had been, this test would have malfunctioned differently Since it's rather late in the release cycle, let's punt. This patch makes it impossible to install cgroup+bpf hooks outside the init netns and makes them not run on sockets that aren't in the init netns. In a future release, it should be relatively straightforward to make these hooks be local to a netns and, if needed, to add a flag so that hooks can be made global if necessary. Global hooks should presumably be constrained so that they can't write to any ifindex fields. Cc: Daniel Borkmann Cc: Alexei Starovoitov Cc: David Ahern Signed-off-by: Andy Lutomirski --- DaveM, this mitigates a bug in a feature that's new in 4.10, and the bug can be hit using current iproute2 -git. please consider this for -net. Changes from v1: - Fix the commit message. 'git commit' was very clever and thought that all the interesting bits of the test case were intended to be comments and stripped them. Whoops! kernel/bpf/cgroup.c | 21 + kernel/bpf/syscall.c | 11 +++ 2 files changed, 32 insertions(+) diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c index a515f7b007c6..a824f543de69 100644 --- a/kernel/bpf/cgroup.c +++ b/kernel/bpf/cgroup.c @@ -143,6 +143,17 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk, if (!sk || !sk_fullsock(sk)) return 0; + /* +* For now, socket bpf hooks attached to cgroups can only be +* installed in the init netns and only affect the init netns. +* This could be relaxed in the future once some semantic issues +* are resolved. For example, ifindexes belonging to one netns +* should probably not be visible to hooks installed by programs +* running in a different netns. +*/ + if (sock_net(sk) != &init_net) + return 0; + if (sk->sk_family != AF_INET && sk->sk_family != AF_INET6) return 0; @@ -186,6 +197,16 @@ int __cgroup_bpf_run_filter_sk(struct sock *sk, struct bpf_prog *prog; int ret = 0; + /* +* For now, socket bpf hooks attached to cgroups can only be +* installed in the init netns and only affect the init netns. +* This could be relaxed in the future once some semantic issues +* are resolved. For example, ifindexes belonging to one netns +* should probably not be visible to hooks installed by programs +* running in a different netns. +*/ + if (sock_net(sk) != &init_net) + return 0; rcu_read_lock(); diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index e89acea22ecf..c0bbc55e244d 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -902,6 +902,17 @@ static int bpf_prog_attach(const union bpf_attr *attr) struct cgroup *cgrp; enum bpf_prog_type ptype; + /* +* For now, socket bpf hooks attached to cgroups can only be +* installed in the init netns and only affect the init netns. +* This could be relaxed in the future once some semantic issues +* are resolved. For example, ifindexes belonging to one netns +* should probably not be visible to hooks installed by programs +* running in a different netns. +*/ + if (current->nsproxy->net_ns != &init_net) + return -EINVAL; + if (!capable(CAP_NET_ADMIN)) return -EPERM; -- 2.9.3