Re: [PATCH net-next RFC v1 00/27] afnetns: new namespace type for separation on protocol level

2017-03-14 Thread Eric W. Biederman
Hannes Frederic Sowa  writes:

> On 13.03.2017 23:06, Eric W. Biederman wrote:
>> Michael Kerrisk  writes:
>> 
>>> On Mon, Mar 13, 2017 at 12:44 AM, Hannes Frederic Sowa
>>>  wrote:
 Hi,

 On Sun, 2017-03-12 at 16:26 -0700, David Miller wrote:
> From: Hannes Frederic Sowa 
> Date: Mon, 13 Mar 2017 00:01:24 +0100
>
>> afnetns behaves like ordinary namespaces: clone, unshare, setns syscalls
>> can work with afnetns with one limitation: one cannot cross the realm
>> of a network namespace while changing the afnetns compartement. To get
>> into a new afnetns in a different net namespace, one must first change
>> to the net namespace and afterwards switch to the desired afnetns.
>
> Please explain why this is useful, who wants this kind of facility,
> and how it will be used.

 Yes, I have to enhance the cover letter:

 The work behind all this is to provide more dense container hosting.
 Right now we lose performance, because all packets need to be forwarded
 through either a bridge or must be routed until they reach the
 containers. For example, we can't make use of early demuxing for the
 incoming packets. We basically pass the networking stack twice for
 every packet.

 The usage is very much in line with how network namespaces are used
 nowadays:

 ip afnetns add afns-1
 ip address add 192.168.1.1/24 dev eth0 afnetns afns-1
 ip afnetns exec afns-1 /usr/sbin/httpd

 this spawns a shell where all child processes will only have access to
 the specific ip addresses, even though they do a wildcard bind. Source
 address selection will also use only the ip addresses available to the
 children.

 In some sense it has lots of characteristics like ipvlan, allowing a
 single MAC address to host lots of IP addresses which will end up in
 different namespaces. Unlink ipvlan however, it will also solve the
 problem around duplicate address detection and multiplexing packets to
 the IGMP or MLD state machines.

 The resource consumption in comparison with ordinary namespaces will be
 much lower. All in all, we will have far less networking subsystems to
 cross compared to normal netns solutions.

 Some more information also in the first patch, which adds a
 Documentation.
>> 
>> If the goal is one ip address per network namespace with a network
>> device and mac address on the network I have something that I was
>> working on that I believe is in the end is a much simpler solution.
>
> Actually, it should be possible to use more than one IP address per
> namespace, proper source address selection should deal with that and
> also correctly select the higher scored ones, based on output device and
> distance to the remote ip address.

Definitely.  I should have said at least one.  Some people want address
sharing and precludes several kinds of optimizations.

>> Add routes in the routing table between network namespaces.
>> 
>> AKA in the initial network namespace with the network device have
>> an input route not towards the local loopback device but towards
>> the network namespaces loopback device.
>> 
>> Before other issues took precedence I made it half way to implementing
>> that.   The ip input path won't get confused if the destination network
>> device is not in the same network namespace as the device.  Last I
>> looked the ip output path still had a few places where confusion was
>> possible between the network socket and the output device.
>
> The ip afnetns input path is also of no concern to me and will work
> quite easily. Right now, the different semantics and rules for selecting
> a source address are the more problematic ones. I think, that in the
> case of directly routing from one ns into another this will be the same
> and the most complex case to deal with?

With what I am proposing that case should be drop dead simple and cause
no confusion.  The extra routes should look like ordinary routes
for forwarding packets, not local addresses and as such should cause
no confusion.  So source address selection should work perfectly as is.

>> As long as installing such routes is conditional upon having
>> CAP_NET_ADMIN in both network namespaces you should be fine and things
>> should be very simple and very fast.  Because that won't take a special
>> case through the network stack.
>> 
>> Given that performance is your primary motive I suspect this will yield
>> the fastest possible path through the network stack as no extra steps
>> need to be taken, and can benefit from any routing improvements to the
>> ordinary network stack.
>
> The major performance improvements come from socket early demuxing,
> which actually requires the remote netns socket being visible in the
> initial netns esock tables. We need the 

Re: [PATCH net-next RFC v1 00/27] afnetns: new namespace type for separation on protocol level

2017-03-14 Thread Hannes Frederic Sowa
On 13.03.2017 23:06, Eric W. Biederman wrote:
> Michael Kerrisk  writes:
> 
>> On Mon, Mar 13, 2017 at 12:44 AM, Hannes Frederic Sowa
>>  wrote:
>>> Hi,
>>>
>>> On Sun, 2017-03-12 at 16:26 -0700, David Miller wrote:
 From: Hannes Frederic Sowa 
 Date: Mon, 13 Mar 2017 00:01:24 +0100

> afnetns behaves like ordinary namespaces: clone, unshare, setns syscalls
> can work with afnetns with one limitation: one cannot cross the realm
> of a network namespace while changing the afnetns compartement. To get
> into a new afnetns in a different net namespace, one must first change
> to the net namespace and afterwards switch to the desired afnetns.

 Please explain why this is useful, who wants this kind of facility,
 and how it will be used.
>>>
>>> Yes, I have to enhance the cover letter:
>>>
>>> The work behind all this is to provide more dense container hosting.
>>> Right now we lose performance, because all packets need to be forwarded
>>> through either a bridge or must be routed until they reach the
>>> containers. For example, we can't make use of early demuxing for the
>>> incoming packets. We basically pass the networking stack twice for
>>> every packet.
>>>
>>> The usage is very much in line with how network namespaces are used
>>> nowadays:
>>>
>>> ip afnetns add afns-1
>>> ip address add 192.168.1.1/24 dev eth0 afnetns afns-1
>>> ip afnetns exec afns-1 /usr/sbin/httpd
>>>
>>> this spawns a shell where all child processes will only have access to
>>> the specific ip addresses, even though they do a wildcard bind. Source
>>> address selection will also use only the ip addresses available to the
>>> children.
>>>
>>> In some sense it has lots of characteristics like ipvlan, allowing a
>>> single MAC address to host lots of IP addresses which will end up in
>>> different namespaces. Unlink ipvlan however, it will also solve the
>>> problem around duplicate address detection and multiplexing packets to
>>> the IGMP or MLD state machines.
>>>
>>> The resource consumption in comparison with ordinary namespaces will be
>>> much lower. All in all, we will have far less networking subsystems to
>>> cross compared to normal netns solutions.
>>>
>>> Some more information also in the first patch, which adds a
>>> Documentation.
> 
> If the goal is one ip address per network namespace with a network
> device and mac address on the network I have something that I was
> working on that I believe is in the end is a much simpler solution.

Actually, it should be possible to use more than one IP address per
namespace, proper source address selection should deal with that and
also correctly select the higher scored ones, based on output device and
distance to the remote ip address.

> Add routes in the routing table between network namespaces.
> 
> AKA in the initial network namespace with the network device have
> an input route not towards the local loopback device but towards
> the network namespaces loopback device.
> 
> Before other issues took precedence I made it half way to implementing
> that.   The ip input path won't get confused if the destination network
> device is not in the same network namespace as the device.  Last I
> looked the ip output path still had a few places where confusion was
> possible between the network socket and the output device.

The ip afnetns input path is also of no concern to me and will work
quite easily. Right now, the different semantics and rules for selecting
a source address are the more problematic ones. I think, that in the
case of directly routing from one ns into another this will be the same
and the most complex case to deal with?

> As long as installing such routes is conditional upon having
> CAP_NET_ADMIN in both network namespaces you should be fine and things
> should be very simple and very fast.  Because that won't take a special
> case through the network stack.
> 
> Given that performance is your primary motive I suspect this will yield
> the fastest possible path through the network stack as no extra steps
> need to be taken, and can benefit from any routing improvements to the
> ordinary network stack.

The major performance improvements come from socket early demuxing,
which actually requires the remote netns socket being visible in the
initial netns esock tables. We need the same for the representations for
IP addresses to have ARP/NDISC work correctly. As soon as you try to
just cross one data structure from one netns to another one, it gets
really difficult to keep track of all the dependencies. It felt way more
complex than this approach.

Thanks for your comments!

Bye,
Hannes




Re: [PATCH net-next RFC v1 00/27] afnetns: new namespace type for separation on protocol level

2017-03-13 Thread Eric W. Biederman
Michael Kerrisk  writes:

> On Mon, Mar 13, 2017 at 12:44 AM, Hannes Frederic Sowa
>  wrote:
>> Hi,
>>
>> On Sun, 2017-03-12 at 16:26 -0700, David Miller wrote:
>>> From: Hannes Frederic Sowa 
>>> Date: Mon, 13 Mar 2017 00:01:24 +0100
>>>
>>> > afnetns behaves like ordinary namespaces: clone, unshare, setns syscalls
>>> > can work with afnetns with one limitation: one cannot cross the realm
>>> > of a network namespace while changing the afnetns compartement. To get
>>> > into a new afnetns in a different net namespace, one must first change
>>> > to the net namespace and afterwards switch to the desired afnetns.
>>>
>>> Please explain why this is useful, who wants this kind of facility,
>>> and how it will be used.
>>
>> Yes, I have to enhance the cover letter:
>>
>> The work behind all this is to provide more dense container hosting.
>> Right now we lose performance, because all packets need to be forwarded
>> through either a bridge or must be routed until they reach the
>> containers. For example, we can't make use of early demuxing for the
>> incoming packets. We basically pass the networking stack twice for
>> every packet.
>>
>> The usage is very much in line with how network namespaces are used
>> nowadays:
>>
>> ip afnetns add afns-1
>> ip address add 192.168.1.1/24 dev eth0 afnetns afns-1
>> ip afnetns exec afns-1 /usr/sbin/httpd
>>
>> this spawns a shell where all child processes will only have access to
>> the specific ip addresses, even though they do a wildcard bind. Source
>> address selection will also use only the ip addresses available to the
>> children.
>>
>> In some sense it has lots of characteristics like ipvlan, allowing a
>> single MAC address to host lots of IP addresses which will end up in
>> different namespaces. Unlink ipvlan however, it will also solve the
>> problem around duplicate address detection and multiplexing packets to
>> the IGMP or MLD state machines.
>>
>> The resource consumption in comparison with ordinary namespaces will be
>> much lower. All in all, we will have far less networking subsystems to
>> cross compared to normal netns solutions.
>>
>> Some more information also in the first patch, which adds a
>> Documentation.

If the goal is one ip address per network namespace with a network
device and mac address on the network I have something that I was
working on that I believe is in the end is a much simpler solution.

Add routes in the routing table between network namespaces.

AKA in the initial network namespace with the network device have
an input route not towards the local loopback device but towards
the network namespaces loopback device.

Before other issues took precedence I made it half way to implementing
that.   The ip input path won't get confused if the destination network
device is not in the same network namespace as the device.  Last I
looked the ip output path still had a few places where confusion was
possible between the network socket and the output device.

As long as installing such routes is conditional upon having
CAP_NET_ADMIN in both network namespaces you should be fine and things
should be very simple and very fast.  Because that won't take a special
case through the network stack.

Given that performance is your primary motive I suspect this will yield
the fastest possible path through the network stack as no extra steps
need to be taken, and can benefit from any routing improvements to the
ordinary network stack.

Eric




Re: [PATCH net-next RFC v1 00/27] afnetns: new namespace type for separation on protocol level

2017-03-13 Thread Michael Kerrisk
[CC += linux-...@vger.kernel.org]

Hannes,

Since this is a kernel-user-space API change, please CC linux-api@
(and on future iterations of the series). The kernel source file
Documentation/SubmitChecklist notes that all Linux kernel patches that
change userspace interfaces should be CCed to
linux-...@vger.kernel.org, so that the various parties who are
interested in API changes are informed. For further information, see
https://www.kernel.org/doc/man-pages/linux-api-ml.html

Thanks,

Michael


On Mon, Mar 13, 2017 at 12:44 AM, Hannes Frederic Sowa
 wrote:
> Hi,
>
> On Sun, 2017-03-12 at 16:26 -0700, David Miller wrote:
>> From: Hannes Frederic Sowa 
>> Date: Mon, 13 Mar 2017 00:01:24 +0100
>>
>> > afnetns behaves like ordinary namespaces: clone, unshare, setns syscalls
>> > can work with afnetns with one limitation: one cannot cross the realm
>> > of a network namespace while changing the afnetns compartement. To get
>> > into a new afnetns in a different net namespace, one must first change
>> > to the net namespace and afterwards switch to the desired afnetns.
>>
>> Please explain why this is useful, who wants this kind of facility,
>> and how it will be used.
>
> Yes, I have to enhance the cover letter:
>
> The work behind all this is to provide more dense container hosting.
> Right now we lose performance, because all packets need to be forwarded
> through either a bridge or must be routed until they reach the
> containers. For example, we can't make use of early demuxing for the
> incoming packets. We basically pass the networking stack twice for
> every packet.
>
> The usage is very much in line with how network namespaces are used
> nowadays:
>
> ip afnetns add afns-1
> ip address add 192.168.1.1/24 dev eth0 afnetns afns-1
> ip afnetns exec afns-1 /usr/sbin/httpd
>
> this spawns a shell where all child processes will only have access to
> the specific ip addresses, even though they do a wildcard bind. Source
> address selection will also use only the ip addresses available to the
> children.
>
> In some sense it has lots of characteristics like ipvlan, allowing a
> single MAC address to host lots of IP addresses which will end up in
> different namespaces. Unlink ipvlan however, it will also solve the
> problem around duplicate address detection and multiplexing packets to
> the IGMP or MLD state machines.
>
> The resource consumption in comparison with ordinary namespaces will be
> much lower. All in all, we will have far less networking subsystems to
> cross compared to normal netns solutions.
>
> Some more information also in the first patch, which adds a
> Documentation.
>
> Bye,
> Hannes
>



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/


Re: [PATCH net-next RFC v1 00/27] afnetns: new namespace type for separation on protocol level

2017-03-12 Thread Hannes Frederic Sowa
Hi,

On Sun, 2017-03-12 at 16:26 -0700, David Miller wrote:
> From: Hannes Frederic Sowa 
> Date: Mon, 13 Mar 2017 00:01:24 +0100
> 
> > afnetns behaves like ordinary namespaces: clone, unshare, setns syscalls
> > can work with afnetns with one limitation: one cannot cross the realm
> > of a network namespace while changing the afnetns compartement. To get
> > into a new afnetns in a different net namespace, one must first change
> > to the net namespace and afterwards switch to the desired afnetns.
> 
> Please explain why this is useful, who wants this kind of facility,
> and how it will be used.

Yes, I have to enhance the cover letter:

The work behind all this is to provide more dense container hosting.
Right now we lose performance, because all packets need to be forwarded
through either a bridge or must be routed until they reach the
containers. For example, we can't make use of early demuxing for the
incoming packets. We basically pass the networking stack twice for
every packet.

The usage is very much in line with how network namespaces are used
nowadays:

ip afnetns add afns-1
ip address add 192.168.1.1/24 dev eth0 afnetns afns-1
ip afnetns exec afns-1 /usr/sbin/httpd

this spawns a shell where all child processes will only have access to
the specific ip addresses, even though they do a wildcard bind. Source
address selection will also use only the ip addresses available to the
children.

In some sense it has lots of characteristics like ipvlan, allowing a
single MAC address to host lots of IP addresses which will end up in
different namespaces. Unlink ipvlan however, it will also solve the
problem around duplicate address detection and multiplexing packets to
the IGMP or MLD state machines.

The resource consumption in comparison with ordinary namespaces will be
much lower. All in all, we will have far less networking subsystems to
cross compared to normal netns solutions.

Some more information also in the first patch, which adds a
Documentation.

Bye,
Hannes



Re: [PATCH net-next RFC v1 00/27] afnetns: new namespace type for separation on protocol level

2017-03-12 Thread David Miller
From: Hannes Frederic Sowa 
Date: Mon, 13 Mar 2017 00:01:24 +0100

> afnetns behaves like ordinary namespaces: clone, unshare, setns syscalls
> can work with afnetns with one limitation: one cannot cross the realm
> of a network namespace while changing the afnetns compartement. To get
> into a new afnetns in a different net namespace, one must first change
> to the net namespace and afterwards switch to the desired afnetns.

Please explain why this is useful, who wants this kind of facility,
and how it will be used.

Thank you.


[PATCH net-next RFC v1 00/27] afnetns: new namespace type for separation on protocol level

2017-03-12 Thread Hannes Frederic Sowa
--- >8 ---
Note:
* BE CAREFUL SOURCE ADDRESS SELECTION 
--- >8 ---

afnetns behaves like ordinary namespaces: clone, unshare, setns syscalls
can work with afnetns with one limitation: one cannot cross the realm
of a network namespace while changing the afnetns compartement. To get
into a new afnetns in a different net namespace, one must first change
to the net namespace and afterwards switch to the desired afnetns.

The primitive objects in the kernel an afnetns relates to are,
- process
- socket
- ipv4 address
- ipv6 address.

An afnetns basically forms a namespace around socket binds. While not
strictly necessary, it also affects the source routing, so firewall rules
are easier to maintain. It does in now way deal with the reception and
handling of multicast or broadcast sockets. As the afnetns namespaces
are connecting to the same L2 network, it does not make sense to try to
build up separation rules here, as they can be broken anyway.

In comparison to ipvlan, afnetns allows early to use early socket
demuxing.

Loopback is not possible within an afnetns until its own loopback device
is added or its private ip address is used.

The easiest way to use afnetns is to use the iproute2 interface, which
very much follows the style of ip-netns.

$ ip afnetns help
Usage: ip afnetns list
   ip afnetns add NAME
   ip afnetns del NAME
   ip afnetns exec NAME cmd ...

IP addresses carry a afnetns identifier, too. It is visible with the -d
(details) option:

$ ip -d a l dev lo
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group 
default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 promiscuity 0 
numtxqueues 1 numrxqueues 1
inet 127.0.0.1/8 scope host lo
   valid_lft forever preferred_lft forever afnet afnet:[4026531958],self
inet6 ::1/128 scope host
   valid_lft forever preferred_lft forever afnet afnet:[4026531958],self

This shows the afnetns inode number, as well as that we are currently in
the same namespace as the two specified ip addresses. In case we added
a name for the namespace with ip-afnetns, it will be visible here, too.

$ ip a a 10.0.0.1/24 dev lo afnetns test

This command adds a new ip address to the loopback device and makes it
available in the test afnetns. Commands in this namespace can use this
IP address and use it for outgoing communication.

Changelog:
v1) first published version

The same commands work for IPv6, I only used IPv4 as an example.

This is still work in progress.

Hannes Frederic Sowa (27):
  afnetns: add CLONE_NEWAFNET flag
  afnetns: basic namespace operations and representations
  afnetns: prepare for integration into ipv4
  afnetns: add net_afnetns
  afnetns: ipv6 integration
  afnetns: put afnetns pointer into struct sock
  ipv4: introduce ifa_find_rcu
  afnetns: factor out inet_allow_bind
  afnetns: add sock_afnetns
  afnetns: add ifa_find_afnetns_rcu
  afnetns: validate afnetns in inet_allow_bind
  afnetns: ipv4/udp integration
  afnetns: use inet_allow_bind in inet6_bind
  afnetns: check for afnetns in inet6_bind
  afnetns: add ipv6_get_ifaddr_afnetns_rcu
  afnetns: add udpv6 support
  afnetns: introduce __inet_select_addr
  afnetns: afnetns should influence source address selection
  afnetns: add afnetns support for tcpv4
  ipv6: move ipv6_get_ifaddr to vmlinux in case ipv6 is build as module
  afnetns: add support for tcpv6
  afnetns: track owning namespace for inet_bind
  afnetns: use user_ns from afnetns for checking for binding to port <
1024
  afnetns: check afnetns user_ns in inet6_bind
  afnetns: ipv4: inherit afnetns from calling application
  afnetns: ipv6: inherit afnetns from calling application
  afnetns: allow only whitelisted protocols to operate inside afnetns

 Documentation/networking/afnetns.txt|  64 +
 drivers/target/iscsi/cxgbit/cxgbit_cm.c |   2 +-
 fs/proc/namespaces.c|   3 +
 include/linux/inetdevice.h  |  22 -
 include/linux/nsproxy.h |   3 +
 include/linux/proc_ns.h |   1 +
 include/net/addrconf.h  |  26 +-
 include/net/afnetns.h   |  47 ++
 include/net/if_inet6.h  |   3 +
 include/net/inet_common.h   |   1 +
 include/net/inet_sock.h |   1 +
 include/net/net_namespace.h |  12 +++
 include/net/protocol.h  |   1 +
 include/net/route.h |  10 +-
 include/net/sock.h  |  13 +++
 include/uapi/linux/if_addr.h|   2 +
 include/uapi/linux/sched.h  |   1 +
 kernel/fork.c   |  12 ++-
 kernel/nsproxy.c|  24 -
 net/Kconfig |  10 ++
 net/core/Makefile   |   1 +
 net/core/afnetns.c  | 159 
 net/core/net_namespace.c|  25 +
 net/core/sock.c