Re: [PATCH net-next RFC v1 00/27] afnetns: new namespace type for separation on protocol level
Hannes Frederic Sowawrites: > On 13.03.2017 23:06, Eric W. Biederman wrote: >> Michael Kerrisk writes: >> >>> On Mon, Mar 13, 2017 at 12:44 AM, Hannes Frederic Sowa >>> wrote: Hi, On Sun, 2017-03-12 at 16:26 -0700, David Miller wrote: > From: Hannes Frederic Sowa > Date: Mon, 13 Mar 2017 00:01:24 +0100 > >> afnetns behaves like ordinary namespaces: clone, unshare, setns syscalls >> can work with afnetns with one limitation: one cannot cross the realm >> of a network namespace while changing the afnetns compartement. To get >> into a new afnetns in a different net namespace, one must first change >> to the net namespace and afterwards switch to the desired afnetns. > > Please explain why this is useful, who wants this kind of facility, > and how it will be used. Yes, I have to enhance the cover letter: The work behind all this is to provide more dense container hosting. Right now we lose performance, because all packets need to be forwarded through either a bridge or must be routed until they reach the containers. For example, we can't make use of early demuxing for the incoming packets. We basically pass the networking stack twice for every packet. The usage is very much in line with how network namespaces are used nowadays: ip afnetns add afns-1 ip address add 192.168.1.1/24 dev eth0 afnetns afns-1 ip afnetns exec afns-1 /usr/sbin/httpd this spawns a shell where all child processes will only have access to the specific ip addresses, even though they do a wildcard bind. Source address selection will also use only the ip addresses available to the children. In some sense it has lots of characteristics like ipvlan, allowing a single MAC address to host lots of IP addresses which will end up in different namespaces. Unlink ipvlan however, it will also solve the problem around duplicate address detection and multiplexing packets to the IGMP or MLD state machines. The resource consumption in comparison with ordinary namespaces will be much lower. All in all, we will have far less networking subsystems to cross compared to normal netns solutions. Some more information also in the first patch, which adds a Documentation. >> >> If the goal is one ip address per network namespace with a network >> device and mac address on the network I have something that I was >> working on that I believe is in the end is a much simpler solution. > > Actually, it should be possible to use more than one IP address per > namespace, proper source address selection should deal with that and > also correctly select the higher scored ones, based on output device and > distance to the remote ip address. Definitely. I should have said at least one. Some people want address sharing and precludes several kinds of optimizations. >> Add routes in the routing table between network namespaces. >> >> AKA in the initial network namespace with the network device have >> an input route not towards the local loopback device but towards >> the network namespaces loopback device. >> >> Before other issues took precedence I made it half way to implementing >> that. The ip input path won't get confused if the destination network >> device is not in the same network namespace as the device. Last I >> looked the ip output path still had a few places where confusion was >> possible between the network socket and the output device. > > The ip afnetns input path is also of no concern to me and will work > quite easily. Right now, the different semantics and rules for selecting > a source address are the more problematic ones. I think, that in the > case of directly routing from one ns into another this will be the same > and the most complex case to deal with? With what I am proposing that case should be drop dead simple and cause no confusion. The extra routes should look like ordinary routes for forwarding packets, not local addresses and as such should cause no confusion. So source address selection should work perfectly as is. >> As long as installing such routes is conditional upon having >> CAP_NET_ADMIN in both network namespaces you should be fine and things >> should be very simple and very fast. Because that won't take a special >> case through the network stack. >> >> Given that performance is your primary motive I suspect this will yield >> the fastest possible path through the network stack as no extra steps >> need to be taken, and can benefit from any routing improvements to the >> ordinary network stack. > > The major performance improvements come from socket early demuxing, > which actually requires the remote netns socket being visible in the > initial netns esock tables. We need the
Re: [PATCH net-next RFC v1 00/27] afnetns: new namespace type for separation on protocol level
On 13.03.2017 23:06, Eric W. Biederman wrote: > Michael Kerriskwrites: > >> On Mon, Mar 13, 2017 at 12:44 AM, Hannes Frederic Sowa >> wrote: >>> Hi, >>> >>> On Sun, 2017-03-12 at 16:26 -0700, David Miller wrote: From: Hannes Frederic Sowa Date: Mon, 13 Mar 2017 00:01:24 +0100 > afnetns behaves like ordinary namespaces: clone, unshare, setns syscalls > can work with afnetns with one limitation: one cannot cross the realm > of a network namespace while changing the afnetns compartement. To get > into a new afnetns in a different net namespace, one must first change > to the net namespace and afterwards switch to the desired afnetns. Please explain why this is useful, who wants this kind of facility, and how it will be used. >>> >>> Yes, I have to enhance the cover letter: >>> >>> The work behind all this is to provide more dense container hosting. >>> Right now we lose performance, because all packets need to be forwarded >>> through either a bridge or must be routed until they reach the >>> containers. For example, we can't make use of early demuxing for the >>> incoming packets. We basically pass the networking stack twice for >>> every packet. >>> >>> The usage is very much in line with how network namespaces are used >>> nowadays: >>> >>> ip afnetns add afns-1 >>> ip address add 192.168.1.1/24 dev eth0 afnetns afns-1 >>> ip afnetns exec afns-1 /usr/sbin/httpd >>> >>> this spawns a shell where all child processes will only have access to >>> the specific ip addresses, even though they do a wildcard bind. Source >>> address selection will also use only the ip addresses available to the >>> children. >>> >>> In some sense it has lots of characteristics like ipvlan, allowing a >>> single MAC address to host lots of IP addresses which will end up in >>> different namespaces. Unlink ipvlan however, it will also solve the >>> problem around duplicate address detection and multiplexing packets to >>> the IGMP or MLD state machines. >>> >>> The resource consumption in comparison with ordinary namespaces will be >>> much lower. All in all, we will have far less networking subsystems to >>> cross compared to normal netns solutions. >>> >>> Some more information also in the first patch, which adds a >>> Documentation. > > If the goal is one ip address per network namespace with a network > device and mac address on the network I have something that I was > working on that I believe is in the end is a much simpler solution. Actually, it should be possible to use more than one IP address per namespace, proper source address selection should deal with that and also correctly select the higher scored ones, based on output device and distance to the remote ip address. > Add routes in the routing table between network namespaces. > > AKA in the initial network namespace with the network device have > an input route not towards the local loopback device but towards > the network namespaces loopback device. > > Before other issues took precedence I made it half way to implementing > that. The ip input path won't get confused if the destination network > device is not in the same network namespace as the device. Last I > looked the ip output path still had a few places where confusion was > possible between the network socket and the output device. The ip afnetns input path is also of no concern to me and will work quite easily. Right now, the different semantics and rules for selecting a source address are the more problematic ones. I think, that in the case of directly routing from one ns into another this will be the same and the most complex case to deal with? > As long as installing such routes is conditional upon having > CAP_NET_ADMIN in both network namespaces you should be fine and things > should be very simple and very fast. Because that won't take a special > case through the network stack. > > Given that performance is your primary motive I suspect this will yield > the fastest possible path through the network stack as no extra steps > need to be taken, and can benefit from any routing improvements to the > ordinary network stack. The major performance improvements come from socket early demuxing, which actually requires the remote netns socket being visible in the initial netns esock tables. We need the same for the representations for IP addresses to have ARP/NDISC work correctly. As soon as you try to just cross one data structure from one netns to another one, it gets really difficult to keep track of all the dependencies. It felt way more complex than this approach. Thanks for your comments! Bye, Hannes
Re: [PATCH net-next RFC v1 00/27] afnetns: new namespace type for separation on protocol level
Michael Kerriskwrites: > On Mon, Mar 13, 2017 at 12:44 AM, Hannes Frederic Sowa > wrote: >> Hi, >> >> On Sun, 2017-03-12 at 16:26 -0700, David Miller wrote: >>> From: Hannes Frederic Sowa >>> Date: Mon, 13 Mar 2017 00:01:24 +0100 >>> >>> > afnetns behaves like ordinary namespaces: clone, unshare, setns syscalls >>> > can work with afnetns with one limitation: one cannot cross the realm >>> > of a network namespace while changing the afnetns compartement. To get >>> > into a new afnetns in a different net namespace, one must first change >>> > to the net namespace and afterwards switch to the desired afnetns. >>> >>> Please explain why this is useful, who wants this kind of facility, >>> and how it will be used. >> >> Yes, I have to enhance the cover letter: >> >> The work behind all this is to provide more dense container hosting. >> Right now we lose performance, because all packets need to be forwarded >> through either a bridge or must be routed until they reach the >> containers. For example, we can't make use of early demuxing for the >> incoming packets. We basically pass the networking stack twice for >> every packet. >> >> The usage is very much in line with how network namespaces are used >> nowadays: >> >> ip afnetns add afns-1 >> ip address add 192.168.1.1/24 dev eth0 afnetns afns-1 >> ip afnetns exec afns-1 /usr/sbin/httpd >> >> this spawns a shell where all child processes will only have access to >> the specific ip addresses, even though they do a wildcard bind. Source >> address selection will also use only the ip addresses available to the >> children. >> >> In some sense it has lots of characteristics like ipvlan, allowing a >> single MAC address to host lots of IP addresses which will end up in >> different namespaces. Unlink ipvlan however, it will also solve the >> problem around duplicate address detection and multiplexing packets to >> the IGMP or MLD state machines. >> >> The resource consumption in comparison with ordinary namespaces will be >> much lower. All in all, we will have far less networking subsystems to >> cross compared to normal netns solutions. >> >> Some more information also in the first patch, which adds a >> Documentation. If the goal is one ip address per network namespace with a network device and mac address on the network I have something that I was working on that I believe is in the end is a much simpler solution. Add routes in the routing table between network namespaces. AKA in the initial network namespace with the network device have an input route not towards the local loopback device but towards the network namespaces loopback device. Before other issues took precedence I made it half way to implementing that. The ip input path won't get confused if the destination network device is not in the same network namespace as the device. Last I looked the ip output path still had a few places where confusion was possible between the network socket and the output device. As long as installing such routes is conditional upon having CAP_NET_ADMIN in both network namespaces you should be fine and things should be very simple and very fast. Because that won't take a special case through the network stack. Given that performance is your primary motive I suspect this will yield the fastest possible path through the network stack as no extra steps need to be taken, and can benefit from any routing improvements to the ordinary network stack. Eric
Re: [PATCH net-next RFC v1 00/27] afnetns: new namespace type for separation on protocol level
[CC += linux-...@vger.kernel.org] Hannes, Since this is a kernel-user-space API change, please CC linux-api@ (and on future iterations of the series). The kernel source file Documentation/SubmitChecklist notes that all Linux kernel patches that change userspace interfaces should be CCed to linux-...@vger.kernel.org, so that the various parties who are interested in API changes are informed. For further information, see https://www.kernel.org/doc/man-pages/linux-api-ml.html Thanks, Michael On Mon, Mar 13, 2017 at 12:44 AM, Hannes Frederic Sowawrote: > Hi, > > On Sun, 2017-03-12 at 16:26 -0700, David Miller wrote: >> From: Hannes Frederic Sowa >> Date: Mon, 13 Mar 2017 00:01:24 +0100 >> >> > afnetns behaves like ordinary namespaces: clone, unshare, setns syscalls >> > can work with afnetns with one limitation: one cannot cross the realm >> > of a network namespace while changing the afnetns compartement. To get >> > into a new afnetns in a different net namespace, one must first change >> > to the net namespace and afterwards switch to the desired afnetns. >> >> Please explain why this is useful, who wants this kind of facility, >> and how it will be used. > > Yes, I have to enhance the cover letter: > > The work behind all this is to provide more dense container hosting. > Right now we lose performance, because all packets need to be forwarded > through either a bridge or must be routed until they reach the > containers. For example, we can't make use of early demuxing for the > incoming packets. We basically pass the networking stack twice for > every packet. > > The usage is very much in line with how network namespaces are used > nowadays: > > ip afnetns add afns-1 > ip address add 192.168.1.1/24 dev eth0 afnetns afns-1 > ip afnetns exec afns-1 /usr/sbin/httpd > > this spawns a shell where all child processes will only have access to > the specific ip addresses, even though they do a wildcard bind. Source > address selection will also use only the ip addresses available to the > children. > > In some sense it has lots of characteristics like ipvlan, allowing a > single MAC address to host lots of IP addresses which will end up in > different namespaces. Unlink ipvlan however, it will also solve the > problem around duplicate address detection and multiplexing packets to > the IGMP or MLD state machines. > > The resource consumption in comparison with ordinary namespaces will be > much lower. All in all, we will have far less networking subsystems to > cross compared to normal netns solutions. > > Some more information also in the first patch, which adds a > Documentation. > > Bye, > Hannes > -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Author of "The Linux Programming Interface", http://blog.man7.org/
Re: [PATCH net-next RFC v1 00/27] afnetns: new namespace type for separation on protocol level
Hi, On Sun, 2017-03-12 at 16:26 -0700, David Miller wrote: > From: Hannes Frederic Sowa> Date: Mon, 13 Mar 2017 00:01:24 +0100 > > > afnetns behaves like ordinary namespaces: clone, unshare, setns syscalls > > can work with afnetns with one limitation: one cannot cross the realm > > of a network namespace while changing the afnetns compartement. To get > > into a new afnetns in a different net namespace, one must first change > > to the net namespace and afterwards switch to the desired afnetns. > > Please explain why this is useful, who wants this kind of facility, > and how it will be used. Yes, I have to enhance the cover letter: The work behind all this is to provide more dense container hosting. Right now we lose performance, because all packets need to be forwarded through either a bridge or must be routed until they reach the containers. For example, we can't make use of early demuxing for the incoming packets. We basically pass the networking stack twice for every packet. The usage is very much in line with how network namespaces are used nowadays: ip afnetns add afns-1 ip address add 192.168.1.1/24 dev eth0 afnetns afns-1 ip afnetns exec afns-1 /usr/sbin/httpd this spawns a shell where all child processes will only have access to the specific ip addresses, even though they do a wildcard bind. Source address selection will also use only the ip addresses available to the children. In some sense it has lots of characteristics like ipvlan, allowing a single MAC address to host lots of IP addresses which will end up in different namespaces. Unlink ipvlan however, it will also solve the problem around duplicate address detection and multiplexing packets to the IGMP or MLD state machines. The resource consumption in comparison with ordinary namespaces will be much lower. All in all, we will have far less networking subsystems to cross compared to normal netns solutions. Some more information also in the first patch, which adds a Documentation. Bye, Hannes
Re: [PATCH net-next RFC v1 00/27] afnetns: new namespace type for separation on protocol level
From: Hannes Frederic SowaDate: Mon, 13 Mar 2017 00:01:24 +0100 > afnetns behaves like ordinary namespaces: clone, unshare, setns syscalls > can work with afnetns with one limitation: one cannot cross the realm > of a network namespace while changing the afnetns compartement. To get > into a new afnetns in a different net namespace, one must first change > to the net namespace and afterwards switch to the desired afnetns. Please explain why this is useful, who wants this kind of facility, and how it will be used. Thank you.
[PATCH net-next RFC v1 00/27] afnetns: new namespace type for separation on protocol level
--- >8 --- Note: * BE CAREFUL SOURCE ADDRESS SELECTION --- >8 --- afnetns behaves like ordinary namespaces: clone, unshare, setns syscalls can work with afnetns with one limitation: one cannot cross the realm of a network namespace while changing the afnetns compartement. To get into a new afnetns in a different net namespace, one must first change to the net namespace and afterwards switch to the desired afnetns. The primitive objects in the kernel an afnetns relates to are, - process - socket - ipv4 address - ipv6 address. An afnetns basically forms a namespace around socket binds. While not strictly necessary, it also affects the source routing, so firewall rules are easier to maintain. It does in now way deal with the reception and handling of multicast or broadcast sockets. As the afnetns namespaces are connecting to the same L2 network, it does not make sense to try to build up separation rules here, as they can be broken anyway. In comparison to ipvlan, afnetns allows early to use early socket demuxing. Loopback is not possible within an afnetns until its own loopback device is added or its private ip address is used. The easiest way to use afnetns is to use the iproute2 interface, which very much follows the style of ip-netns. $ ip afnetns help Usage: ip afnetns list ip afnetns add NAME ip afnetns del NAME ip afnetns exec NAME cmd ... IP addresses carry a afnetns identifier, too. It is visible with the -d (details) option: $ ip -d a l dev lo 1: lo:mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 promiscuity 0 numtxqueues 1 numrxqueues 1 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever afnet afnet:[4026531958],self inet6 ::1/128 scope host valid_lft forever preferred_lft forever afnet afnet:[4026531958],self This shows the afnetns inode number, as well as that we are currently in the same namespace as the two specified ip addresses. In case we added a name for the namespace with ip-afnetns, it will be visible here, too. $ ip a a 10.0.0.1/24 dev lo afnetns test This command adds a new ip address to the loopback device and makes it available in the test afnetns. Commands in this namespace can use this IP address and use it for outgoing communication. Changelog: v1) first published version The same commands work for IPv6, I only used IPv4 as an example. This is still work in progress. Hannes Frederic Sowa (27): afnetns: add CLONE_NEWAFNET flag afnetns: basic namespace operations and representations afnetns: prepare for integration into ipv4 afnetns: add net_afnetns afnetns: ipv6 integration afnetns: put afnetns pointer into struct sock ipv4: introduce ifa_find_rcu afnetns: factor out inet_allow_bind afnetns: add sock_afnetns afnetns: add ifa_find_afnetns_rcu afnetns: validate afnetns in inet_allow_bind afnetns: ipv4/udp integration afnetns: use inet_allow_bind in inet6_bind afnetns: check for afnetns in inet6_bind afnetns: add ipv6_get_ifaddr_afnetns_rcu afnetns: add udpv6 support afnetns: introduce __inet_select_addr afnetns: afnetns should influence source address selection afnetns: add afnetns support for tcpv4 ipv6: move ipv6_get_ifaddr to vmlinux in case ipv6 is build as module afnetns: add support for tcpv6 afnetns: track owning namespace for inet_bind afnetns: use user_ns from afnetns for checking for binding to port < 1024 afnetns: check afnetns user_ns in inet6_bind afnetns: ipv4: inherit afnetns from calling application afnetns: ipv6: inherit afnetns from calling application afnetns: allow only whitelisted protocols to operate inside afnetns Documentation/networking/afnetns.txt| 64 + drivers/target/iscsi/cxgbit/cxgbit_cm.c | 2 +- fs/proc/namespaces.c| 3 + include/linux/inetdevice.h | 22 - include/linux/nsproxy.h | 3 + include/linux/proc_ns.h | 1 + include/net/addrconf.h | 26 +- include/net/afnetns.h | 47 ++ include/net/if_inet6.h | 3 + include/net/inet_common.h | 1 + include/net/inet_sock.h | 1 + include/net/net_namespace.h | 12 +++ include/net/protocol.h | 1 + include/net/route.h | 10 +- include/net/sock.h | 13 +++ include/uapi/linux/if_addr.h| 2 + include/uapi/linux/sched.h | 1 + kernel/fork.c | 12 ++- kernel/nsproxy.c| 24 - net/Kconfig | 10 ++ net/core/Makefile | 1 + net/core/afnetns.c | 159 net/core/net_namespace.c| 25 + net/core/sock.c