Re: [RFC] network namespaces
Andrey Savochkin wrote: Hi All, I'd like to resurrect our discussion about network namespaces. In our previous discussions it appeared that we have rather polar concepts which seemed hard to reconcile. Now I have an idea how to look at all discussed concepts to enable everyone's usage scenario. Hi Andrey, I have a few questions ... sorry for asking so late ;) 1. The most straightforward concept is complete separation of namespaces, covering device list, routing tables, netfilter tables, socket hashes, and everything else. On input path, each packet is tagged with namespace right from the place where it appears from a device, and is processed by each layer in the context of this namespace. If you have the namespace where is coming the packet, why do you tag the packet instead of switching to the right namespace ? Non-root namespaces communicate with the outside world in two ways: by owning hardware devices, or receiving packets forwarded them by their parent namespace via pass-through device. Do you will do proxy arp and ip forwarding into the root namespace in order to make non-root namespace visible from the outside world ? Regards. -- Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
Sorry, dont' understand your proposal correctly from the previous talk. :) But... On Tuesday 12 September 2006 07:28, Eric W. Biederman wrote: Do you have some concrete arguments against the proposal? Yes, I have. I think it is unnecessary complication. This complication will followed in additional bugs. Especially if we'll accept rules creation in userspace. Why we need complex solution, if there are only two approaches to socket bound - isolation and virtualization? These approaches could co-exist without hooks. Or you probably have thoughts about other ways? -- Thanks, Dmitry. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [RFC] network namespaces
Dmitry Mishin wrote: On Friday 08 September 2006 22:11, Herbert Poetzl wrote: actually the light-weight ip isolation runs perfectly fine _without_ CAP_NET_ADMIN, as you do not want the guest to be able to mess with the 'configured' ips at all (not to speak of interfaces here) It was only an example. I'm thinking about how to implement flexible solution, which permits light-weight ip isolation as well as full-fledged netwrok virtualization. Another solution is to split CONFIG_NET_NAMESPACE. Is it good for you? Hi Dmitry, I am currently working on this and I am finishing a prototype bringing isolation at the ip layer. The prototype code is very closed to Andrey's patches at TCP/UDP level. So the next step is to merge the prototype code with the existing network namespace layer 2 isolation. IHMO, the solution of spliting CONFIG_NET_NS into CONFIG_L2_NET_NS and CONFIG_L3_NET_NS is for me not acceptable because you will need to recompile the kernel. The proper way is certainly to have a specific flag for the unshare, something like CLONE_NEW_L2_NET and CLONE_NEW_L3_NET for example. -- Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [RFC] network namespaces
On Mon, Sep 11, 2006 at 04:40:59PM +0200, Daniel Lezcano wrote: Dmitry Mishin wrote: On Friday 08 September 2006 22:11, Herbert Poetzl wrote: actually the light-weight ip isolation runs perfectly fine _without_ CAP_NET_ADMIN, as you do not want the guest to be able to mess with the 'configured' ips at all (not to speak of interfaces here) It was only an example. I'm thinking about how to implement flexible solution, which permits light-weight ip isolation as well as full-fledged netwrok virtualization. Another solution is to split CONFIG_NET_NAMESPACE. Is it good for you? Hi Dmitry, I am currently working on this and I am finishing a prototype bringing isolation at the ip layer. The prototype code is very closed to Andrey's patches at TCP/UDP level. So the next step is to merge the prototype code with the existing network namespace layer 2 isolation. you might want to take a look at the current Linux-VServer implementation for the network isolation too, should be quite similar to Andrey's approach, but maybe you can gather some additional information from there IHMO, the solution of spliting CONFIG_NET_NS into CONFIG_L2_NET_NS and CONFIG_L3_NET_NS is for me not acceptable because you will need to recompile the kernel. The proper way is certainly to have a specific flag for the unshare, something like CLONE_NEW_L2_NET and CLONE_NEW_L3_NET for example. I completely agree here, we need a separate namespace for that, so that we can combine isolation and virtualization as needed, unless the bind restrictions can be completely expressed with an additional mangle or filter table (as was suggested) best, Herbert -- Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [RFC] network namespaces
Herbert Poetzl wrote: On Mon, Sep 11, 2006 at 04:40:59PM +0200, Daniel Lezcano wrote: I am currently working on this and I am finishing a prototype bringing isolation at the ip layer. The prototype code is very closed to Andrey's patches at TCP/UDP level. So the next step is to merge the prototype code with the existing network namespace layer 2 isolation. you might want to take a look at the current Linux-VServer implementation for the network isolation too, should be quite similar to Andrey's approach, but maybe you can gather some additional information from there ok, thanks. I will. IHMO, the solution of spliting CONFIG_NET_NS into CONFIG_L2_NET_NS and CONFIG_L3_NET_NS is for me not acceptable because you will need to recompile the kernel. The proper way is certainly to have a specific flag for the unshare, something like CLONE_NEW_L2_NET and CLONE_NEW_L3_NET for example. I completely agree here, we need a separate namespace for that, so that we can combine isolation and virtualization as needed, unless the bind restrictions can be completely expressed with an additional mangle or filter table (as was suggested) What is the bind restriction ? Do you want to force binding to a specific source address ? -- Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [RFC] network namespaces
On Monday 11 September 2006 18:57, Herbert Poetzl wrote: I completely agree here, we need a separate namespace for that, so that we can combine isolation and virtualization as needed, unless the bind restrictions can be completely expressed with an additional mangle or filter table (as was suggested) iptables are designed for packet flow decisions and filtering, it has nothing common with bind restrictions. So, it may be only packet flow scheduling/filtering, but it will not help to resolve bind-time IP conflicts. -- Thanks, Dmitry. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
Dmitry Mishin [EMAIL PROTECTED] writes: On Sunday 10 September 2006 06:47, Herbert Poetzl wrote: well, I think it would be best to have both, as they are complementary to some degree, and IMHO both, the full virtualization _and_ the isolation will require a separate namespace to work, [snip] I do not think that folks would want to recompile their kernel just to get a light-weight guest or a fully virtualized one In this case light-weight guest will have unnecessary overhead. For example, instead of using static pointer, we have to find the required common namespace before. And there will be no advantages for such guest over full-featured. Dmitry that just isn't true if implemented properly. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
Dmitry Mishin [EMAIL PROTECTED] writes: On Monday 11 September 2006 18:57, Herbert Poetzl wrote: I completely agree here, we need a separate namespace for that, so that we can combine isolation and virtualization as needed, unless the bind restrictions can be completely expressed with an additional mangle or filter table (as was suggested) iptables are designed for packet flow decisions and filtering, it has nothing common with bind restrictions. So, it may be only packet flow scheduling/filtering, but it will not help to resolve bind-time IP conflicts. Please read the archive, where the suggestion was made. What was suggested was a new table, with it's own set of chains. So we could make filtering decisions on where sockets could be bound. That is not a far stretch from where iptables is today. Do you have some concrete arguments against the proposal? Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [RFC] network namespaces
On Sunday 10 September 2006 06:47, Herbert Poetzl wrote: well, I think it would be best to have both, as they are complementary to some degree, and IMHO both, the full virtualization _and_ the isolation will require a separate namespace to work, [snip] I do not think that folks would want to recompile their kernel just to get a light-weight guest or a fully virtualized one In this case light-weight guest will have unnecessary overhead. For example, instead of using static pointer, we have to find the required common namespace before. And there will be no advantages for such guest over full-featured. best, Herbert -- Thanks, Dmitry. -- Thanks, Dmitry. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [RFC] network namespaces
On Sunday 10 September 2006 07:41, Eric W. Biederman wrote: I certainly agree that we are not at a point where a final decision can be made. A major piece of that is that a layer 2 approach has not shown to be without a performance penalty. But it is required. Why to limit possible usages? A practical question. Do the IPs assigned to guests ever get used by anything besides the guest? In case of level2 virtualization - no. -- Thanks, Dmitry. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
Dmitry Mishin [EMAIL PROTECTED] writes: On Sunday 10 September 2006 07:41, Eric W. Biederman wrote: I certainly agree that we are not at a point where a final decision can be made. A major piece of that is that a layer 2 approach has not shown to be without a performance penalty. But it is required. Why to limit possible usages? Wrong perspective. The point is that we need to dig in and show that there is no measurable penalty for the current cases. Showing that there is little penalty for the advanced configurations is a plus. The practical question is, do we need to implement the grand unified lookup before we can do this cheaply, or can we implement this without needing that optimization? To get a perspective, to get a good implementation of the pid namespace I am having to refactor significant parts of the kernel so it uses abstractions that can cleanly express what we are doing. The networking stack is in better shape but there is a lot of it. A practical question. Do the IPs assigned to guests ever get used by anything besides the guest? In case of level2 virtualization - no. Actually that is one of the benefits of a layer 2 implementation you can set up weird things like shared IPs, that various types of fail over scenarios want. My question was really about the layer 3 bind filtering techniques, and how people are using them. The basic attraction with layer 3 is that you can do a simple implementation, and it will run very fast, and it doesn't need to conflict with the layer 2 work at all. If you can make that layer 3 implementation clean and generally mergeable as well it is worth pursuing. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [RFC] network namespaces
On Sat, Sep 09, 2006 at 09:41:35PM -0600, Eric W. Biederman wrote: Herbert Poetzl [EMAIL PROTECTED] writes: On Sat, Sep 09, 2006 at 11:57:24AM +0400, Dmitry Mishin wrote: On Friday 08 September 2006 22:11, Herbert Poetzl wrote: actually the light-weight ip isolation runs perfectly fine _without_ CAP_NET_ADMIN, as you do not want the guest to be able to mess with the 'configured' ips at all (not to speak of interfaces here) It was only an example. I'm thinking about how to implement flexible solution, which permits light-weight ip isolation as well as full-fledged netwrok virtualization. Another solution is to split CONFIG_NET_NAMESPACE. Is it good for you? well, I think it would be best to have both, as they are complementary to some degree, and IMHO both, the full virtualization _and_ the isolation will require a separate namespace to work, I also think that limiting the isolation to something very simple (like one IP + network or so) would be acceptable for a start, because especially multi IP or network range checks require a little more efford to get them right ... I do not think that folks would want to recompile their kernel just to get a light-weight guest or a fully virtualized one I certainly agree that we are not at a point where a final decision can be made. A major piece of that is that a layer 2 approach has not shown to be without a performance penalty. A practical question. Do the IPs assigned to guests ever get used by anything besides the guest? only in special setups and for testing routing and general operation of course, i.e. one typical failure scenario is this: - 'provider' has a bunch of ips assigned - 'host' ip works perfectly - 'guest' ip is not routed (by the external router) in this case, for example, I always suggest to test on the host with a guest ip, simplest example: ping -I guest-ip google.com but for 'normal' operation, the guest ip is reserved for the guests, unless some service like named is shared between guests ... HTH, Herbert Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [RFC] network namespaces
On Sun, Sep 10, 2006 at 11:45:35AM +0400, Dmitry Mishin wrote: On Sunday 10 September 2006 06:47, Herbert Poetzl wrote: well, I think it would be best to have both, as they are complementary to some degree, and IMHO both, the full virtualization _and_ the isolation will require a separate namespace to work, [snip] I do not think that folks would want to recompile their kernel just to get a light-weight guest or a fully virtualized one In this case light-weight guest will have unnecessary overhead. For example, instead of using static pointer, we have to find the required common namespace before. this is only required at 'bind' time, which makes a non measurable fraction of the actual connection usage (unless you keep binding ports over and over without ever using them) And there will be no advantages for such guest over full-featured. the advantage is in the flexibility, simplicity of setup and the basically non-existant overhead on the hot (conenction/transfer) part ... best, Herbert -- Thanks, Dmitry. -- Thanks, Dmitry. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [RFC] network namespaces
On Friday 08 September 2006 22:11, Herbert Poetzl wrote: actually the light-weight ip isolation runs perfectly fine _without_ CAP_NET_ADMIN, as you do not want the guest to be able to mess with the 'configured' ips at all (not to speak of interfaces here) It was only an example. I'm thinking about how to implement flexible solution, which permits light-weight ip isolation as well as full-fledged netwrok virtualization. Another solution is to split CONFIG_NET_NAMESPACE. Is it good for you? -- Thanks, Dmitry. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [RFC] network namespaces
On Sat, Sep 09, 2006 at 11:57:24AM +0400, Dmitry Mishin wrote: On Friday 08 September 2006 22:11, Herbert Poetzl wrote: actually the light-weight ip isolation runs perfectly fine _without_ CAP_NET_ADMIN, as you do not want the guest to be able to mess with the 'configured' ips at all (not to speak of interfaces here) It was only an example. I'm thinking about how to implement flexible solution, which permits light-weight ip isolation as well as full-fledged netwrok virtualization. Another solution is to split CONFIG_NET_NAMESPACE. Is it good for you? well, I think it would be best to have both, as they are complementary to some degree, and IMHO both, the full virtualization _and_ the isolation will require a separate namespace to work, I also think that limiting the isolation to something very simple (like one IP + network or so) would be acceptable for a start, because especially multi IP or network range checks require a little more efford to get them right ... I do not think that folks would want to recompile their kernel just to get a light-weight guest or a fully virtualized one best, Herbert -- Thanks, Dmitry. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [RFC] network namespaces
Herbert Poetzl [EMAIL PROTECTED] writes: On Sat, Sep 09, 2006 at 11:57:24AM +0400, Dmitry Mishin wrote: On Friday 08 September 2006 22:11, Herbert Poetzl wrote: actually the light-weight ip isolation runs perfectly fine _without_ CAP_NET_ADMIN, as you do not want the guest to be able to mess with the 'configured' ips at all (not to speak of interfaces here) It was only an example. I'm thinking about how to implement flexible solution, which permits light-weight ip isolation as well as full-fledged netwrok virtualization. Another solution is to split CONFIG_NET_NAMESPACE. Is it good for you? well, I think it would be best to have both, as they are complementary to some degree, and IMHO both, the full virtualization _and_ the isolation will require a separate namespace to work, I also think that limiting the isolation to something very simple (like one IP + network or so) would be acceptable for a start, because especially multi IP or network range checks require a little more efford to get them right ... I do not think that folks would want to recompile their kernel just to get a light-weight guest or a fully virtualized one I certainly agree that we are not at a point where a final decision can be made. A major piece of that is that a layer 2 approach has not shown to be without a performance penalty. A practical question. Do the IPs assigned to guests ever get used by anything besides the guest? Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
On Thu, Sep 07, 2006 at 12:29:21PM -0600, Eric W. Biederman wrote: Daniel Lezcano [EMAIL PROTECTED] writes: IHMO, I think there is one reason. The unsharing mechanism is not only for containers, its aim other kind of isolation like a bsdjail for example. The unshare syscall is flexible, shall the network unsharing be one-block solution ? For example, we want to launch an application using TCP/IP and we want to have an IP address only used by the application, nothing more. With a layer 2, we must after unsharing: 1) create a virtual device into the application namespace 2) assign an IP address 3) create a virtual device pass-through in the root namespace 4) set the virtual device IP All this stuff, need a lot of administration (check mac addresses conflicts, check interface names collision in root namespace, ...) for a simple network isolation. Yes, and even more it is hard to show that it will perform as well. Although by dropping CAP_NET_ADMIN the actual runtime administration is about the same. With a layer 3: 1) assign an IP address In the other hand, a layer 3 isolation is not sufficient to reach the level of isolation/virtualization needed for the system containers. Agreed. Very soon, I will commit more info at: http://wiki.openvz.org/Containers/Networking So the consensus is based on the fact that there is a lot of common code for the layer 2 and layer 3 isolation/virtualization and we can find a way to merge the 2 implementation in order to have a flexible network virtualization/isolation. NACK In a real level 3 implementation there is very little common code with a layer 2 implementation. You don't need to muck with the socket handling code as you are not allowed to dup addresses between containers. Look at what Serge did that is layer 3. A layer 3 isolation implementation should either be a new security module or a new form of iptables. The problem with using the lsm is that it seems to be an all or nothing mechanism so is a very coarse grained tool for this job. IMHO LSM was never an option for that, because it is a) very complicated to use it for that purpose b) missing many hooks you definitely need to make this work c) is not really efficient and/or performant with something 'like' iptables, this could be done, but I'm not sure that is the best approach either ... best, Herbert A layer 2 implementation (where you have network devices isolated and not sockets) should be a namespace. Eric ___ Containers mailing list [EMAIL PROTECTED] https://lists.osdl.org/mailman/listinfo/containers - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [RFC] network namespaces
On Thursday 07 September 2006 21:27, Herbert Poetzl wrote: well, who said that you need to have things like RAW sockets or other protocols except IP, not to speak of iptable and routing entries ... folks who _want_ full network virtualization can use the more complete virtual setup and be happy ... Let's think about how to implement this. As I understood VServer's design, your proposal is to split CAP_NET_ADMIN to multiple capabilities and use them if required. So, for your light-weight container it is enough to implement context isolation for protected by CAP_NET_IP capability (for example) code and put 'if (!capable(CAP_NET_*))' checks to all other places. But this could be easily implemented over OpenVZ code by CAP_VE_NET_ADMIN split. So, the question is: Could you point out the places in Andrey's implementation of network namespaces, which prevents you to add CAP_NET_ADMIN separation later? -- Thanks, Dmitry. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [RFC] network namespaces
On Fri, Sep 08, 2006 at 05:10:08PM +0400, Dmitry Mishin wrote: On Thursday 07 September 2006 21:27, Herbert Poetzl wrote: well, who said that you need to have things like RAW sockets or other protocols except IP, not to speak of iptable and routing entries ... folks who _want_ full network virtualization can use the more complete virtual setup and be happy ... Let's think about how to implement this. As I understood VServer's design, your proposal is to split CAP_NET_ADMIN to multiple capabilities and use them if required. So, for your light-weight container it is enough to implement context isolation for protected by CAP_NET_IP capability (for example) code and put 'if (!capable(CAP_NET_*))' checks to all other places. actually the light-weight ip isolation runs perfectly fine _without_ CAP_NET_ADMIN, as you do not want the guest to be able to mess with the 'configured' ips at all (not to speak of interfaces here) best, Herbert But this could be easily implemented over OpenVZ code by CAP_VE_NET_ADMIN split. So, the question is: Could you point out the places in Andrey's implementation of network namespaces, which prevents you to add CAP_NET_ADMIN separation later? -- Thanks, Dmitry. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
Caitlin Bestler wrote: [EMAIL PROTECTED] wrote: Finally, as I understand both network isolation and network virtualization (both level2 and level3) can happily co-exist. We do have several filesystems in kernel. Let's have several network virtualization approaches, and let a user choose. Is that makes sense? If there are not compelling arguments for using both ways of doing it is silly to merge both, as it is more maintenance overhead. My reading is that full virtualization (Xen, etc.) calls for implementing L2 switching between the partitions and the physical NIC(s). The tradeoffs between L2 and L3 switching are indeed complex, but there are two implications of doing L2 switching between partitions: 1) Do we really want to ask device drivers to support L2 switching for partitions and something *different* for containers? 2) Do we really want any single packet to traverse an L2 switch (for the partition-style virtualization layer) and then an L3 switch (for the container-style layer)? The full virtualization solution calls for virtual NICs with distinct MAC addresses. Is there any reason why this same solution cannot work for containers (just creating more than one VNIC for the partition, and then assigning each VNIC to a container?) IHMO, I think there is one reason. The unsharing mechanism is not only for containers, its aim other kind of isolation like a bsdjail for example. The unshare syscall is flexible, shall the network unsharing be one-block solution ? For example, we want to launch an application using TCP/IP and we want to have an IP address only used by the application, nothing more. With a layer 2, we must after unsharing: 1) create a virtual device into the application namespace 2) assign an IP address 3) create a virtual device pass-through in the root namespace 4) set the virtual device IP All this stuff, need a lot of administration (check mac addresses conflicts, check interface names collision in root namespace, ...) for a simple network isolation. With a layer 3: 1) assign an IP address In the other hand, a layer 3 isolation is not sufficient to reach the level of isolation/virtualization needed for the system containers. Very soon, I will commit more info at: http://wiki.openvz.org/Containers/Networking So the consensus is based on the fact that there is a lot of common code for the layer 2 and layer 3 isolation/virtualization and we can find a way to merge the 2 implementation in order to have a flexible network virtualization/isolation. -- Regards Daniel. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [RFC] network namespaces
Herbert Poetzl wrote: my point (until we have an implementation which clearly shows that performance is equal/better to isolation) is simply this: of course, you can 'simulate' or 'construct' all the isolation scenarios with kernel bridging and routing and tricky injection/marking of packets, but, this usually comes with an overhead ... Well, TANSTAAFL*, and pretty much everything comes with an overhead. Multitasking comes with the (scheduler, context switch, CPU cache, etc.) overhead -- is that the reason to abandon it? OpenVZ and Linux-VServer resource management also adds some overhead -- do we want to throw it away? The question is not just equal or better performance, the question is what do we get and how much we pay for it. Equal or better performance is certainly required when we have the code compiled in but aren't using it. We must not penalize the current code. you talk about host system performance. Both approaches do not introduce overhead to host networking. Finally, as I understand both network isolation and network virtualization (both level2 and level3) can happily co-exist. We do have several filesystems in kernel. Let's have several network virtualization approaches, and let a user choose. Is that makes sense? If there are not compelling arguments for using both ways of doing it is silly to merge both, as it is more maintenance overhead. That said I think there is a real chance if we can look at the bind filtering and find a way to express that in the networking stack through iptables. Using the security hooks conflicts with things like selinux. Although it would be interesting to see if selinux can already implement general purpose layer 3 filtering. The more I look the gut feel I have is that the way to proceed would be to add a new table that filters binds, and connects. Plus a new module that would look at a process creating a socket and tell us if it is the appropriate group of processes. With a little care that would be a general solution to the layer 3 filtering problem. Huh, you will still have to insert lots of access checks into different parts of code like RAW sockets, netlinks, protocols which are not inserted, netfilters (to not allow create iptables rules :) ) and many many other places. I see Dave Miller looking at such a patch and my ears hear his rude words :) Thanks, Kirill - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [RFC] network namespaces
On Thu, Sep 07, 2006 at 08:23:53PM +0400, Kirill Korotaev wrote: Herbert Poetzl wrote: my point (until we have an implementation which clearly shows that performance is equal/better to isolation) is simply this: of course, you can 'simulate' or 'construct' all the isolation scenarios with kernel bridging and routing and tricky injection/marking of packets, but, this usually comes with an overhead ... Well, TANSTAAFL*, and pretty much everything comes with an overhead. Multitasking comes with the (scheduler, context switch, CPU cache, etc.) overhead -- is that the reason to abandon it? OpenVZ and Linux-VServer resource management also adds some overhead -- do we want to throw it away? The question is not just equal or better performance, the question is what do we get and how much we pay for it. Equal or better performance is certainly required when we have the code compiled in but aren't using it. We must not penalize the current code. you talk about host system performance. Both approaches do not introduce overhead to host networking. Finally, as I understand both network isolation and network virtualization (both level2 and level3) can happily co-exist. We do have several filesystems in kernel. Let's have several network virtualization approaches, and let a user choose. Is that makes sense? If there are not compelling arguments for using both ways of doing it is silly to merge both, as it is more maintenance overhead. That said I think there is a real chance if we can look at the bind filtering and find a way to express that in the networking stack through iptables. Using the security hooks conflicts with things like selinux. Although it would be interesting to see if selinux can already implement general purpose layer 3 filtering. The more I look the gut feel I have is that the way to proceed would be to add a new table that filters binds, and connects. Plus a new module that would look at a process creating a socket and tell us if it is the appropriate group of processes. With a little care that would be a general solution to the layer 3 filtering problem. Huh, you will still have to insert lots of access checks into different parts of code like RAW sockets, netlinks, protocols which are not inserted, netfilters (to not allow create iptables rules :) ) and many many other places. well, who said that you need to have things like RAW sockets or other protocols except IP, not to speak of iptable and routing entries ... folks who _want_ full network virtualization can use the more complete virtual setup and be happy ... best, Herbert I see Dave Miller looking at such a patch and my ears hear his rude words :) Thanks, Kirill ___ Containers mailing list [EMAIL PROTECTED] https://lists.osdl.org/mailman/listinfo/containers - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
Daniel Lezcano [EMAIL PROTECTED] writes: IHMO, I think there is one reason. The unsharing mechanism is not only for containers, its aim other kind of isolation like a bsdjail for example. The unshare syscall is flexible, shall the network unsharing be one-block solution ? For example, we want to launch an application using TCP/IP and we want to have an IP address only used by the application, nothing more. With a layer 2, we must after unsharing: 1) create a virtual device into the application namespace 2) assign an IP address 3) create a virtual device pass-through in the root namespace 4) set the virtual device IP All this stuff, need a lot of administration (check mac addresses conflicts, check interface names collision in root namespace, ...) for a simple network isolation. Yes, and even more it is hard to show that it will perform as well. Although by dropping CAP_NET_ADMIN the actual runtime administration is about the same. With a layer 3: 1) assign an IP address In the other hand, a layer 3 isolation is not sufficient to reach the level of isolation/virtualization needed for the system containers. Agreed. Very soon, I will commit more info at: http://wiki.openvz.org/Containers/Networking So the consensus is based on the fact that there is a lot of common code for the layer 2 and layer 3 isolation/virtualization and we can find a way to merge the 2 implementation in order to have a flexible network virtualization/isolation. NACK In a real level 3 implementation there is very little common code with a layer 2 implementation. You don't need to muck with the socket handling code as you are not allowed to dup addresses between containers. Look at what Serge did that is layer 3. A layer 3 isolation implementation should either be a new security module or a new form of iptables. The problem with using the lsm is that it seems to be an all or nothing mechanism so is a very coarse grained tool for this job. A layer 2 implementation (where you have network devices isolated and not sockets) should be a namespace. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [RFC] network namespaces
Herbert Poetzl [EMAIL PROTECTED] writes: On Thu, Sep 07, 2006 at 08:23:53PM +0400, Kirill Korotaev wrote: well, who said that you need to have things like RAW sockets or other protocols except IP, not to speak of iptable and routing entries ... folks who _want_ full network virtualization can use the more complete virtual setup and be happy ... Exactly this was a proposal for isolation for containers that don't get CAP_NET_ADMIN, with a facility that could easily be general purpose. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
Hi Herbert, well, the 'ip subset' approach Linux-VServer and other Jail solutions use is very clean, it just does not match your expectations of a virtual interface (as there is none) and it does not cope well with all kinds of per context 'requirements', which IMHO do not really exist on the application layer (only on the whole system layer) IMHO that would be quite simple, have a 'namespace' for limiting port binds to a subset of the available ips and another one which does complete network virtualization with all the whistles and bells, IMHO most of them are orthogonal and can easily be combined - full network virtualization - lightweight ip subset - both IMHO this requirement only arises from the full system virtualization approach, just look at the other jail solutions (solaris, bsd, ...) some of them do not even allow for more than a single ip but they work quite well when used properly ... As far as I see, vserver use a layer 3 solution but, when needed, the veth component, made by Nestor Pena, is used to provide a layer 2 virtualization. Right ? Having the two solutions, you have certainly a lot if information about use cases. From the point of view of vserver, can you give some examples of when a layer 3 solution is better/worst than a layer 2 solution ? Who wants a layer 2/3 virtualization and why ? These informations will be very useful. Regards -- Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
On Tue, Sep 05, 2006 at 08:45:39AM -0600, Eric W. Biederman wrote: Daniel Lezcano [EMAIL PROTECTED] writes: For HPC if you are interested in migration you need a separate IP per container. If you can take you IP address with you migration of networking state is simple. If you can't take your IP address with you a network container is nearly pointless from a migration perspective. Beyond that from everything I have seen layer 2 is just much cleaner than any layer 3 approach short of Serge's bind filtering. well, the 'ip subset' approach Linux-VServer and other Jail solutions use is very clean, it just does not match your expectations of a virtual interface (as there is none) and it does not cope well with all kinds of per context 'requirements', which IMHO do not really exist on the application layer (only on the whole system layer) I probably expressed that wrong. There are currently three basic approaches under discussion. Layer 3 (Basically bind filtering) nothing at the packet level. The approach taken by Serge's version of bsdjails and Vserver. Layer 2.5 What Daniel proposed. Layer 2. (Trivially mapping each packet to a different interface) And then treating everything as multiple instances of the network stack. Roughly what OpenVZ and I have implemented. I think classifying network virtualization by Layer X is not good enough. OpenVZ has Layer 3 (venet) and Layer 2 (veth) implementations, but in both cases networking stack inside VE remains fully virtualized. Thanks, Kirill - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [RFC] network namespaces
Kirill Korotaev wrote: I think classifying network virtualization by Layer X is not good enough. OpenVZ has Layer 3 (venet) and Layer 2 (veth) implementations, but in both cases networking stack inside VE remains fully virtualized. Let's describe all those (three?) approaches at http://wiki.openvz.org/Containers/Networking Everyone is able to read and contribute to that, and (I hope) we will come to the common understanding. I have started the article, please enlarge. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
On Wed, Sep 06, 2006 at 11:10:23AM +0200, Daniel Lezcano wrote: Hi Herbert, well, the 'ip subset' approach Linux-VServer and other Jail solutions use is very clean, it just does not match your expectations of a virtual interface (as there is none) and it does not cope well with all kinds of per context 'requirements', which IMHO do not really exist on the application layer (only on the whole system layer) IMHO that would be quite simple, have a 'namespace' for limiting port binds to a subset of the available ips and another one which does complete network virtualization with all the whistles and bells, IMHO most of them are orthogonal and can easily be combined - full network virtualization - lightweight ip subset - both IMHO this requirement only arises from the full system virtualization approach, just look at the other jail solutions (solaris, bsd, ...) some of them do not even allow for more than a single ip but they work quite well when used properly ... As far as I see, vserver use a layer 3 solution but, when needed, the veth component, made by Nestor Pena, is used to provide a layer 2 virtualization. Right ? well, no, we do not explicitely use the VETH daemon for networking, although some folks probably make use of it, mainly because if you realize that this kind of isolation is something different and partially complementary to network virtualization, you can do live without the layer 2 virtualization in almost all cases, nevertheless, for certain purposes layer 2/3 virtualization is required and/or makes perfect sense Having the two solutions, you have certainly a lot if information about use cases. From the point of view of vserver, can you give some examples of when a layer 3 solution is better/worst than a layer 2 solution ? my point (until we have an implementation which clearly shows that performance is equal/better to isolation) is simply this: of course, you can 'simulate' or 'construct' all the isolation scenarios with kernel bridging and routing and tricky injection/marking of packets, but, this usually comes with an overhead ... Who wants a layer 2/3 virtualization and why ? there are some reasons for virtualization instead of pure isolation (as Linux-VServer does it for now) - context migration/snapshot (probably reason #1) - creating network devices inside a guest (can help with vpn and similar) - allowing non IP protocols (like DHCP, ICMP, etc) the problem which arises with this kind of network virtualization is that you need some additional policy for example to avoid sending 'evil' packets and/or (D)DoSing one guest from another, which again adds further overhead, so basically if you 'just' want to have network isolation, you have to do this: - create a 'copy' of your hosts networking inside the guest (with virtual interfaces) - assign all the same (subset) ips and this to the virtual guest interfaces - activate some smart bridging code which 'knows' what ports can be used and/or mapped - add policy to block unwanted connections and/or packets to/from the guest all this sounds very intrusive and for sure (please proove me wrong here :) adds a lot of overhead to the networking itself, while a 'simple' isolation approach for IP (tcp/udp) is (almost) without any cost, certainly without overhead once a connection is established. These informations will be very useful. HTH, Herbert Regards -- Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [RFC] network namespaces
Herbert Poetzl wrote: my point (until we have an implementation which clearly shows that performance is equal/better to isolation) is simply this: of course, you can 'simulate' or 'construct' all the isolation scenarios with kernel bridging and routing and tricky injection/marking of packets, but, this usually comes with an overhead ... Well, TANSTAAFL*, and pretty much everything comes with an overhead. Multitasking comes with the (scheduler, context switch, CPU cache, etc.) overhead -- is that the reason to abandon it? OpenVZ and Linux-VServer resource management also adds some overhead -- do we want to throw it away? The question is not just equal or better performance, the question is what do we get and how much we pay for it. Finally, as I understand both network isolation and network virtualization (both level2 and level3) can happily co-exist. We do have several filesystems in kernel. Let's have several network virtualization approaches, and let a user choose. Is that makes sense? * -- http://en.wikipedia.org/wiki/TANSTAAFL - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
Herbert Poetzl [EMAIL PROTECTED] writes: On Wed, Sep 06, 2006 at 11:10:23AM +0200, Daniel Lezcano wrote: As far as I see, vserver use a layer 3 solution but, when needed, the veth component, made by Nestor Pena, is used to provide a layer 2 virtualization. Right ? well, no, we do not explicitely use the VETH daemon for networking, although some folks probably make use of it, mainly because if you realize that this kind of isolation is something different and partially complementary to network virtualization, you can do live without the layer 2 virtualization in almost all cases, nevertheless, for certain purposes layer 2/3 virtualization is required and/or makes perfect sense Having the two solutions, you have certainly a lot if information about use cases. From the point of view of vserver, can you give some examples of when a layer 3 solution is better/worst than a layer 2 solution ? my point (until we have an implementation which clearly shows that performance is equal/better to isolation) is simply this: of course, you can 'simulate' or 'construct' all the isolation scenarios with kernel bridging and routing and tricky injection/marking of packets, but, this usually comes with an overhead ... Who wants a layer 2/3 virtualization and why ? there are some reasons for virtualization instead of pure isolation (as Linux-VServer does it for now) - context migration/snapshot (probably reason #1) - creating network devices inside a guest (can help with vpn and similar) - allowing non IP protocols (like DHCP, ICMP, etc) the problem which arises with this kind of network virtualization is that you need some additional policy for example to avoid sending 'evil' packets and/or (D)DoSing one guest from another, which again adds further overhead, so basically if you 'just' want to have network isolation, you have to do this: - create a 'copy' of your hosts networking inside the guest (with virtual interfaces) - assign all the same (subset) ips and this to the virtual guest interfaces - activate some smart bridging code which 'knows' what ports can be used and/or mapped - add policy to block unwanted connections and/or packets to/from the guest all this sounds very intrusive and for sure (please proove me wrong here :) adds a lot of overhead to the networking itself, while a 'simple' isolation approach for IP (tcp/udp) is (almost) without any cost, certainly without overhead once a connection is established. Thanks, for the good summary of the situation. I think we can prove you wrong but it is going to take some doing to build a good implementation and take the necessary measurements. Hmm. I wonder if the filtering layer 3 style of isolation can be built with netfilter rules. Just skimming it looks we may be able to do it with something like the netfilter owner module, possibly in conjunction with the connmark or conntrack modules. If not if the infrastructure is close enough we can write our own module. Has anyone looked at network isolation from the netfilter perspective? Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
Kir Kolyshkin [EMAIL PROTECTED] writes: Herbert Poetzl wrote: my point (until we have an implementation which clearly shows that performance is equal/better to isolation) is simply this: of course, you can 'simulate' or 'construct' all the isolation scenarios with kernel bridging and routing and tricky injection/marking of packets, but, this usually comes with an overhead ... Well, TANSTAAFL*, and pretty much everything comes with an overhead. Multitasking comes with the (scheduler, context switch, CPU cache, etc.) overhead -- is that the reason to abandon it? OpenVZ and Linux-VServer resource management also adds some overhead -- do we want to throw it away? The question is not just equal or better performance, the question is what do we get and how much we pay for it. Equal or better performance is certainly required when we have the code compiled in but aren't using it. We must not penalize the current code. Finally, as I understand both network isolation and network virtualization (both level2 and level3) can happily co-exist. We do have several filesystems in kernel. Let's have several network virtualization approaches, and let a user choose. Is that makes sense? If there are not compelling arguments for using both ways of doing it is silly to merge both, as it is more maintenance overhead. That said I think there is a real chance if we can look at the bind filtering and find a way to express that in the networking stack through iptables. Using the security hooks conflicts with things like selinux. Although it would be interesting to see if selinux can already implement general purpose layer 3 filtering. The more I look the gut feel I have is that the way to proceed would be to add a new table that filters binds, and connects. Plus a new module that would look at a process creating a socket and tell us if it is the appropriate group of processes. With a little care that would be a general solution to the layer 3 filtering problem. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
Eric W. Biederman wrote: Kir Kolyshkin [EMAIL PROTECTED] writes: Herbert Poetzl wrote: my point (until we have an implementation which clearly shows that performance is equal/better to isolation) is simply this: of course, you can 'simulate' or 'construct' all the isolation scenarios with kernel bridging and routing and tricky injection/marking of packets, but, this usually comes with an overhead ... Well, TANSTAAFL*, and pretty much everything comes with an overhead. Multitasking comes with the (scheduler, context switch, CPU cache, etc.) overhead -- is that the reason to abandon it? OpenVZ and Linux-VServer resource management also adds some overhead -- do we want to throw it away? The question is not just equal or better performance, the question is what do we get and how much we pay for it. Equal or better performance is certainly required when we have the code compiled in but aren't using it. We must not penalize the current code. That's a valid argument. Although it's not applicable here (at least for both network virtualization types which OpenVZ offers). Kirill/Andrey, please correct me if I'm wrong here. Finally, as I understand both network isolation and network virtualization (both level2 and level3) can happily co-exist. We do have several filesystems in kernel. Let's have several network virtualization approaches, and let a user choose. Is that makes sense? o If there are not compelling arguments for using both ways of doing it is silly to merge both, as it is more maintenance overhead. Definitely a valid argument as well. I am not sure about network isolation (used by Linux-VServer), but as it comes for level2 vs. level3 virtualization, I see a need for both. Here is the easy-to-understand comparison which can shed some light: http://wiki.openvz.org/Differences_between_venet_and_veth Here are a couple of examples * Do we want to let container's owner (i.e. root) to add/remove IP addresses? Most probably not, but in some cases we want that. * Do we want to be able to run DHCP server and/or DHCP client inside a container? Sometimes...but not always. * Do we want to let container's owner to create/manage his own set of iptables? In half of the cases we do. The problem here is single solution will not cover all those scenarios. That said I think there is a real chance if we can look at the bind filtering and find a way to express that in the networking stack through iptables. Using the security hooks conflicts with things like selinux. Although it would be interesting to see if selinux can already implement general purpose layer 3 filtering. The more I look the gut feel I have is that the way to proceed would be to add a new table that filters binds, and connects. Plus a new module that would look at a process creating a socket and tell us if it is the appropriate group of processes. With a little care that would be a general solution to the layer 3 filtering problem. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
Eric W. Biederman wrote: This family of containers are used too for HPC (high performance computing) and for distributed checkpoint/restart. The cluster runs hundred of jobs, spawning them on different hosts inside an application container. Usually the jobs communicates with broadcast and multicast. Application containers does not care of having different MAC address and rely on a layer 3 approach. Ok I think to understand this we need some precise definitions. In the normal case it is an error for a job to communication with a different job. hmm ? What about an MPI application ? I would expect each MPI task to be run in its container on different nodes or on the same node. These individual tasks _communicate_ between each other through the MPI layer (not only TCP btw) to complete a large calculation. The basic advantage with a different MAC is that you can found out who the intended recipient is sooner in the networking stack and you have truly separate network devices. Allowing for a cleaner implementation. Changing the MAC after migration is likely to be fine. indeed. C. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
Cedric Le Goater [EMAIL PROTECTED] writes: Eric W. Biederman wrote: hmm ? What about an MPI application ? I would expect each MPI task to be run in its container on different nodes or on the same node. These individual tasks _communicate_ between each other through the MPI layer (not only TCP btw) to complete a large calculation. All parts of the MPI application are part of the same job. Communication between processes on multiple machines that are part of the job is fine. At least that is how I used job in HPC computer context. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
Kir Kolyshkin wrote: snip I am not sure about network isolation (used by Linux-VServer), but as it comes for level2 vs. level3 virtualization, I see a need for both. Here is the easy-to-understand comparison which can shed some light: http://wiki.openvz.org/Differences_between_venet_and_veth thanks kir, Here are a couple of examples * Do we want to let container's owner (i.e. root) to add/remove IP addresses? Most probably not, but in some cases we want that. * Do we want to be able to run DHCP server and/or DHCP client inside a container? Sometimes...but not always. * Do we want to let container's owner to create/manage his own set of iptables? In half of the cases we do. The problem here is single solution will not cover all those scenarios. some would argue that there is one single solution : Xen or similar. IMO, I think containers should try to leverage their difference, performance, and not try to simulate a real hardware environment. Restricting the network environment of a container should be considered acceptable if this is for the sake of performance. The network interface(s) could be pre-configured and provided to the container. Protocol(s) could be forbidden. Now, if you need more network power in a container, you will need a real or a virtualized interface. But let's consider both alternatives. thanks, C. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [RFC] network namespaces
Kir Kolyshkin wrote: Herbert Poetzl wrote: my point (until we have an implementation which clearly shows that performance is equal/better to isolation) is simply this: of course, you can 'simulate' or 'construct' all the isolation scenarios with kernel bridging and routing and tricky injection/marking of packets, but, this usually comes with an overhead ... Well, TANSTAAFL*, and pretty much everything comes with an overhead. Multitasking comes with the (scheduler, context switch, CPU cache, etc.) overhead -- is that the reason to abandon it? OpenVZ and Linux-VServer resource management also adds some overhead -- do we want to throw it away? The question is not just equal or better performance, the question is what do we get and how much we pay for it. Finally, as I understand both network isolation and network virtualization (both level2 and level3) can happily co-exist. We do have several filesystems in kernel. Let's have several network virtualization approaches, and let a user choose. Is that makes sense? Definitly yes, I agree. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC] network namespaces
[EMAIL PROTECTED] wrote: Finally, as I understand both network isolation and network virtualization (both level2 and level3) can happily co-exist. We do have several filesystems in kernel. Let's have several network virtualization approaches, and let a user choose. Is that makes sense? If there are not compelling arguments for using both ways of doing it is silly to merge both, as it is more maintenance overhead. My reading is that full virtualization (Xen, etc.) calls for implementing L2 switching between the partitions and the physical NIC(s). The tradeoffs between L2 and L3 switching are indeed complex, but there are two implications of doing L2 switching between partitions: 1) Do we really want to ask device drivers to support L2 switching for partitions and something *different* for containers? 2) Do we really want any single packet to traverse an L2 switch (for the partition-style virtualization layer) and then an L3 switch (for the container-style layer)? The full virtualization solution calls for virtual NICs with distinct MAC addresses. Is there any reason why this same solution cannot work for containers (just creating more than one VNIC for the partition, and then assigning each VNIC to a container?) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
Caitlin Bestler [EMAIL PROTECTED] writes: [EMAIL PROTECTED] wrote: Finally, as I understand both network isolation and network virtualization (both level2 and level3) can happily co-exist. We do have several filesystems in kernel. Let's have several network virtualization approaches, and let a user choose. Is that makes sense? If there are not compelling arguments for using both ways of doing it is silly to merge both, as it is more maintenance overhead. My reading is that full virtualization (Xen, etc.) calls for implementing L2 switching between the partitions and the physical NIC(s). The tradeoffs between L2 and L3 switching are indeed complex, but there are two implications of doing L2 switching between partitions: 1) Do we really want to ask device drivers to support L2 switching for partitions and something *different* for containers? No. 2) Do we really want any single packet to traverse an L2 switch (for the partition-style virtualization layer) and then an L3 switch (for the container-style layer)? In general what has been done with layer 3 is to simply filter which processes can use which IP addresses and it all happens at socket creation time. So it is very cheap, and it can be done purely in the network layer without any driver intervention. Basically think of what is happening at layer 3 as an extremely light-weight version of traffic filtering. The full virtualization solution calls for virtual NICs with distinct MAC addresses. Is there any reason why this same solution cannot work for containers (just creating more than one VNIC for the partition, and then assigning each VNIC to a container?) The VNIC approach is the fundamental idea with the layer two networking and if we can push the work down into the device driver it so different destination macs show up a in different packet queues it should be as fast as a normal networking stack. Implementing VNICs so far is the only piece of containers that has come close to device drivers, and we can likely do it without device driver support (but with more cost). Basically this optimization is a subset of the Grand Unified Lookup idea. I think we can do a mergeable implementation without noticeable cost without when not using containers without having to resort to a grand unified lookup but I may be wrong. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
On Wed, 06 Sep 2006 17:25:50 -0600 [EMAIL PROTECTED] (Eric W. Biederman) wrote: Caitlin Bestler [EMAIL PROTECTED] writes: [EMAIL PROTECTED] wrote: Finally, as I understand both network isolation and network virtualization (both level2 and level3) can happily co-exist. We do have several filesystems in kernel. Let's have several network virtualization approaches, and let a user choose. Is that makes sense? If there are not compelling arguments for using both ways of doing it is silly to merge both, as it is more maintenance overhead. My reading is that full virtualization (Xen, etc.) calls for implementing L2 switching between the partitions and the physical NIC(s). The tradeoffs between L2 and L3 switching are indeed complex, but there are two implications of doing L2 switching between partitions: 1) Do we really want to ask device drivers to support L2 switching for partitions and something *different* for containers? No. 2) Do we really want any single packet to traverse an L2 switch (for the partition-style virtualization layer) and then an L3 switch (for the container-style layer)? In general what has been done with layer 3 is to simply filter which processes can use which IP addresses and it all happens at socket creation time. So it is very cheap, and it can be done purely in the network layer without any driver intervention. Basically think of what is happening at layer 3 as an extremely light-weight version of traffic filtering. The full virtualization solution calls for virtual NICs with distinct MAC addresses. Is there any reason why this same solution cannot work for containers (just creating more than one VNIC for the partition, and then assigning each VNIC to a container?) The VNIC approach is the fundamental idea with the layer two networking and if we can push the work down into the device driver it so different destination macs show up a in different packet queues it should be as fast as a normal networking stack. Implementing VNICs so far is the only piece of containers that has come close to device drivers, and we can likely do it without device driver support (but with more cost). Basically this optimization is a subset of the Grand Unified Lookup idea. I think we can do a mergeable implementation without noticeable cost without when not using containers without having to resort to a grand unified lookup but I may be wrong. Eric The problem with VNIC's is it won't work for all devices (without lots of work), and for many device's it requires putting the device in promiscuous mode. It also plays havoc with network access control devices. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
Stephen Hemminger [EMAIL PROTECTED] writes: The problem with VNIC's is it won't work for all devices (without lots of work), and for many device's it requires putting the device in promiscuous mode. It also plays havoc with network access control devices. Which is fine. If it works it is a cool performance optimization. But it doesn't stop anything from my side if it doesn't. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
Hi all, This complete separation of namespaces is very useful for at least two purposes: - allowing users to create and manage by their own various tunnels and VPNs, and - enabling easier and more straightforward live migration of groups of processes with their environment. I conceptually prefer this approach, but I seem to recall there were actual problems in using this for checkpoint/restart of lightweight (application) containers. Performance aside, are there any reasons why this approach would be problematic for c/r? I agree with this approach too, separated namespaces is the best way to identify the network ressources for a specific container. I'm afraid Daniel may be on vacation, and don't know who else other than Eric might have thoughts on this. Yes, I was in vacation, but I am back :) 2. People expressed concerns that complete separation of namespaces may introduce an undesired overhead in certain usage scenarios. The overhead comes from packets traversing input path, then output path, then input path again in the destination namespace if root namespace acts as a router. Yes, performance is probably one issue. My concerns was for layer 2 / layer 3 virtualization. I agree a layer 2 isolation/virtualization is the best for the system container. But there is another family of container called application container, it is not a system which is run inside a container but only the application. If you want to run a oracle database inside a container, you can run it inside an application container without launching init and all the services. This family of containers are used too for HPC (high performance computing) and for distributed checkpoint/restart. The cluster runs hundred of jobs, spawning them on different hosts inside an application container. Usually the jobs communicates with broadcast and multicast. Application containers does not care of having different MAC address and rely on a layer 3 approach. Are application containers comfortable with a layer 2 virtualization ? I don't think so, because several jobs running inside the same host communicate via broadcast/multicast between them and between other jobs running on different hosts. The IP consumption is a problem too: 1 container == 2 IP (one for the root namespace/ one for the container), multiplicated with the number of jobs. Furthermore, lot of jobs == lot of virtual devices. However, after a discussion with Kirill at the OLS, it appears we can merge the layer 2 and 3 approaches if the level of network virtualization is tunable and we can choose layer 2 or layer 3 when doing the unshare. The determination of the namespace for the incoming traffic can be done with an specific iptable module as a first step. While looking at the network namespace patches, it appears that the TCP/UDP part is **very** similar at what is needed for a layer 3 approach. Any thoughts ? Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
Daniel Lezcano [EMAIL PROTECTED] writes: 2. People expressed concerns that complete separation of namespaces may introduce an undesired overhead in certain usage scenarios. The overhead comes from packets traversing input path, then output path, then input path again in the destination namespace if root namespace acts as a router. Yes, performance is probably one issue. My concerns was for layer 2 / layer 3 virtualization. I agree a layer 2 isolation/virtualization is the best for the system container. But there is another family of container called application container, it is not a system which is run inside a container but only the application. If you want to run a oracle database inside a container, you can run it inside an application container without launching init and all the services. This family of containers are used too for HPC (high performance computing) and for distributed checkpoint/restart. The cluster runs hundred of jobs, spawning them on different hosts inside an application container. Usually the jobs communicates with broadcast and multicast. Application containers does not care of having different MAC address and rely on a layer 3 approach. Are application containers comfortable with a layer 2 virtualization ? I don't think so, because several jobs running inside the same host communicate via broadcast/multicast between them and between other jobs running on different hosts. The IP consumption is a problem too: 1 container == 2 IP (one for the root namespace/ one for the container), multiplicated with the number of jobs. Furthermore, lot of jobs == lot of virtual devices. However, after a discussion with Kirill at the OLS, it appears we can merge the layer 2 and 3 approaches if the level of network virtualization is tunable and we can choose layer 2 or layer 3 when doing the unshare. The determination of the namespace for the incoming traffic can be done with an specific iptable module as a first step. While looking at the network namespace patches, it appears that the TCP/UDP part is **very** similar at what is needed for a layer 3 approach. Any thoughts ? For HPC if you are interested in migration you need a separate IP per container. If you can take you IP address with you migration of networking state is simple. If you can't take your IP address with you a network container is nearly pointless from a migration perspective. Beyond that from everything I have seen layer 2 is just much cleaner than any layer 3 approach short of Serge's bind filtering. Beyond that I have yet to see a clean semantics for anything resembling your layer 2 layer 3 hybrid approach. If we can't have clear semantics it is by definition impossible to implement correctly because no one understands what it is supposed to do. Note. A true layer 3 approach has no impact on TCP/UDP filtering because it filters at bind time not at packet reception time. Once you start inspecting packets I don't see what the gain is from not going all of the way to layer 2. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
For HPC if you are interested in migration you need a separate IP per container. If you can take you IP address with you migration of networking state is simple. If you can't take your IP address with you a network container is nearly pointless from a migration perspective. Eric, please, I know... I showed you a migration demo at OLS ;) Beyond that from everything I have seen layer 2 is just much cleaner than any layer 3 approach short of Serge's bind filtering. Beyond that I have yet to see a clean semantics for anything resembling your layer 2 layer 3 hybrid approach. If we can't have clear semantics it is by definition impossible to implement correctly because no one understands what it is supposed to do. Note. A true layer 3 approach has no impact on TCP/UDP filtering because it filters at bind time not at packet reception time. Once you start inspecting packets I don't see what the gain is from not going all of the way to layer 2. The bsdjail was just for information ... - Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
Yes, performance is probably one issue. My concerns was for layer 2 / layer 3 virtualization. I agree a layer 2 isolation/virtualization is the best for the system container. But there is another family of container called application container, it is not a system which is run inside a container but only the application. If you want to run a oracle database inside a container, you can run it inside an application container without launching init and all the services. This family of containers are used too for HPC (high performance computing) and for distributed checkpoint/restart. The cluster runs hundred of jobs, spawning them on different hosts inside an application container. Usually the jobs communicates with broadcast and multicast. Application containers does not care of having different MAC address and rely on a layer 3 approach. Are application containers comfortable with a layer 2 virtualization ? I don't think so, because several jobs running inside the same host communicate via broadcast/multicast between them and between other jobs running on different hosts. The IP consumption is a problem too: 1 container == 2 IP (one for the root namespace/ one for the container), multiplicated with the number of jobs. Furthermore, lot of jobs == lot of virtual devices. However, after a discussion with Kirill at the OLS, it appears we can merge the layer 2 and 3 approaches if the level of network virtualization is tunable and we can choose layer 2 or layer 3 when doing the unshare. The determination of the namespace for the incoming traffic can be done with an specific iptable module as a first step. While looking at the network namespace patches, it appears that the TCP/UDP part is **very** similar at what is needed for a layer 3 approach. Any thoughts ? My humble opinion is that your approach doesn't intersect with this one. So we can freely go with both *if needed*. And hear the comments from network guru guys and what and how to improve. So I suggest you at least to send the patches, so we could discuss it. Thanks, Kirill - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
On Tue, Sep 05, 2006 at 08:45:39AM -0600, Eric W. Biederman wrote: Daniel Lezcano [EMAIL PROTECTED] writes: 2. People expressed concerns that complete separation of namespaces may introduce an undesired overhead in certain usage scenarios. The overhead comes from packets traversing input path, then output path, then input path again in the destination namespace if root namespace acts as a router. Yes, performance is probably one issue. My concerns was for layer 2 / layer 3 virtualization. I agree a layer 2 isolation/virtualization is the best for the system container. But there is another family of container called application container, it is not a system which is run inside a container but only the application. If you want to run a oracle database inside a container, you can run it inside an application container without launching init and all the services. This family of containers are used too for HPC (high performance computing) and for distributed checkpoint/restart. The cluster runs hundred of jobs, spawning them on different hosts inside an application container. Usually the jobs communicates with broadcast and multicast. Application containers does not care of having different MAC address and rely on a layer 3 approach. Are application containers comfortable with a layer 2 virtualization ? I don't think so, because several jobs running inside the same host communicate via broadcast/multicast between them and between other jobs running on different hosts. The IP consumption is a problem too: 1 container == 2 IP (one for the root namespace/ one for the container), multiplicated with the number of jobs. Furthermore, lot of jobs == lot of virtual devices. However, after a discussion with Kirill at the OLS, it appears we can merge the layer 2 and 3 approaches if the level of network virtualization is tunable and we can choose layer 2 or layer 3 when doing the unshare. The determination of the namespace for the incoming traffic can be done with an specific iptable module as a first step. While looking at the network namespace patches, it appears that the TCP/UDP part is **very** similar at what is needed for a layer 3 approach. Any thoughts ? For HPC if you are interested in migration you need a separate IP per container. If you can take you IP address with you migration of networking state is simple. If you can't take your IP address with you a network container is nearly pointless from a migration perspective. Beyond that from everything I have seen layer 2 is just much cleaner than any layer 3 approach short of Serge's bind filtering. well, the 'ip subset' approach Linux-VServer and other Jail solutions use is very clean, it just does not match your expectations of a virtual interface (as there is none) and it does not cope well with all kinds of per context 'requirements', which IMHO do not really exist on the application layer (only on the whole system layer) Beyond that I have yet to see a clean semantics for anything resembling your layer 2 layer 3 hybrid approach. If we can't have clear semantics it is by definition impossible to implement correctly because no one understands what it is supposed to do. IMHO that would be quite simple, have a 'namespace' for limiting port binds to a subset of the available ips and another one which does complete network virtualization with all the whistles and bells, IMHO most of them are orthogonal and can easily be combined - full network virtualization - lightweight ip subset - both Note. A true layer 3 approach has no impact on TCP/UDP filtering because it filters at bind time not at packet reception time. Once you start inspecting packets I don't see what the gain is from not going all of the way to layer 2. IMHO this requirement only arises from the full system virtualization approach, just look at the other jail solutions (solaris, bsd, ...) some of them do not even allow for more than a single ip but they work quite well when used properly ... best, Herbert Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
This family of containers are used too for HPC (high performance computing) and for distributed checkpoint/restart. The cluster runs hundred of jobs, spawning them on different hosts inside an application container. Usually the jobs communicates with broadcast and multicast. Application containers does not care of having different MAC address and rely on a layer 3 approach. Ok I think to understand this we need some precise definitions. In the normal case it is an error for a job to communication with a different job. The basic advantage with a different MAC is that you can found out who the intended recipient is sooner in the networking stack and you have truly separate network devices. Allowing for a cleaner implementation. Changing the MAC after migration is likely to be fine. Are application containers comfortable with a layer 2 virtualization ? I don't think so, because several jobs running inside the same host communicate via broadcast/multicast between them and between other jobs running on different hosts. The IP consumption is a problem too: 1 container == 2 IP (one for the root namespace/ one for the container), multiplicated with the number of jobs. Furthermore, lot of jobs == lot of virtual devices. First if you hook you network namespaces with ethernet bridging you don't need any extra IPs. Second don't see the conflict you perceive between application containers and layer 2 containment. The bottom line is that you need at least one loopback interface per non-trivial network namespace. One you get that having a virtual is no big deal. In addition network devices don't consume less memory than a process. So lots of network devices should not be a problem. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
Herbert Poetzl [EMAIL PROTECTED] writes: On Tue, Sep 05, 2006 at 08:45:39AM -0600, Eric W. Biederman wrote: Daniel Lezcano [EMAIL PROTECTED] writes: For HPC if you are interested in migration you need a separate IP per container. If you can take you IP address with you migration of networking state is simple. If you can't take your IP address with you a network container is nearly pointless from a migration perspective. Beyond that from everything I have seen layer 2 is just much cleaner than any layer 3 approach short of Serge's bind filtering. well, the 'ip subset' approach Linux-VServer and other Jail solutions use is very clean, it just does not match your expectations of a virtual interface (as there is none) and it does not cope well with all kinds of per context 'requirements', which IMHO do not really exist on the application layer (only on the whole system layer) I probably expressed that wrong. There are currently three basic approaches under discussion. Layer 3 (Basically bind filtering) nothing at the packet level. The approach taken by Serge's version of bsdjails and Vserver. Layer 2.5 What Daniel proposed. Layer 2. (Trivially mapping each packet to a different interface) And then treating everything as multiple instances of the network stack. Roughly what OpenVZ and I have implemented. You can get into some weird complications at layer 3 but because it doesn't touch each packet the proof it is fast is trivial. Beyond that I have yet to see a clean semantics for anything resembling your layer 2 layer 3 hybrid approach. If we can't have clear semantics it is by definition impossible to implement correctly because no one understands what it is supposed to do. IMHO that would be quite simple, have a 'namespace' for limiting port binds to a subset of the available ips and another one which does complete network virtualization with all the whistles and bells, IMHO most of them are orthogonal and can easily be combined - full network virtualization - lightweight ip subset - both Quite possibly. The LSM will stay for a while so we do have a clean way to restrict port binds. Note. A true layer 3 approach has no impact on TCP/UDP filtering because it filters at bind time not at packet reception time. Once you start inspecting packets I don't see what the gain is from not going all of the way to layer 2. IMHO this requirement only arises from the full system virtualization approach, just look at the other jail solutions (solaris, bsd, ...) some of them do not even allow for more than a single ip but they work quite well when used properly ... Yes they do. Currently I am strongly opposed to Daniel Layer 2.5 approach as I see no redeeming value in it. A good clean layer 3 approach I avoid only because I think we can do better. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
Basically there are currently 3 approaches that have been proposed. The trivial bsdjail style as implemented by Serge and in a slightly more sophisticated version in vserver. This approach as it does not touch the packets has little to no packet level overhead. Basically this is what I have called the Level 3 approach. The more in depth approach where we modify the packet processing based upon which network interface the packet comes in on, and it looks like each namespace has it's own instance of the network stack. Roughly what was proposed earlier in this thread the Level 2 approach. This potentially has per packet overhead so we need to watch the implementation very carefully. Some weird hybrid as proposed by Daniel, that I was never clear on the semantics. The good thing is that these approaches do not contradict each other. We discussed it with Daniel during the summit and as Andrey proposed some shortcuts can be created to avoid double stack traversing. From the previous conversations my impression was that as long as we could get a Layer 2 approach that did not slow down the networking stack and was clean, everyone would be happy. agree. I'm buried in the process id namespace at the moment, and except to be so for the rest of the month, so I'm not going to be very helpful except for a few stray comments. I will be very much obliged if you find some time to review these new patches so that we could make some progress here. Thanks, Kirill - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] network namespaces
Hi All, I'd like to resurrect our discussion about network namespaces. In our previous discussions it appeared that we have rather polar concepts which seemed hard to reconcile. Now I have an idea how to look at all discussed concepts to enable everyone's usage scenario. 1. The most straightforward concept is complete separation of namespaces, covering device list, routing tables, netfilter tables, socket hashes, and everything else. On input path, each packet is tagged with namespace right from the place where it appears from a device, and is processed by each layer in the context of this namespace. Non-root namespaces communicate with the outside world in two ways: by owning hardware devices, or receiving packets forwarded them by their parent namespace via pass-through device. This complete separation of namespaces is very useful for at least two purposes: - allowing users to create and manage by their own various tunnels and VPNs, and - enabling easier and more straightforward live migration of groups of processes with their environment. 2. People expressed concerns that complete separation of namespaces may introduce an undesired overhead in certain usage scenarios. The overhead comes from packets traversing input path, then output path, then input path again in the destination namespace if root namespace acts as a router. So, we may introduce short-cuts, when input packet starts to be processes in one namespace, but changes it at some upper layer. The places where packet can change namespace are, for example: routing, post-routing netfilter hook, or even lookup in socket hash. The cleanest example among them is post-routing netfilter hook. Tagging of input packets there means that the packets is checked against root namespace's routing table, found to be local, and go directly to the socket hash lookup in the destination namespace. In this scheme the ability to change routing tables or netfilter rules on a per-namespace basis is traded for lower overhead. All other optimized schemes where input packets do not travel input-output-input paths in general case may be viewed as short-cuts in scheme (1). The remaining question is which exactly short-cuts make most sense, and how to make them consistent from the interface point of view. My current idea is to reach some agreement on the basic concept, review patches, and then move on to implementing feasible short-cuts. Opinions? Next in this thread are patches introducing namespaces to device list, IPv4 routing, and socket hashes, and a pass-through device. Patches are against 2.6.18-rc4-mm1. Best regards, Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
Quoting Andrey Savochkin ([EMAIL PROTECTED]): Hi All, I'd like to resurrect our discussion about network namespaces. In our previous discussions it appeared that we have rather polar concepts which seemed hard to reconcile. Now I have an idea how to look at all discussed concepts to enable everyone's usage scenario. 1. The most straightforward concept is complete separation of namespaces, covering device list, routing tables, netfilter tables, socket hashes, and everything else. On input path, each packet is tagged with namespace right from the place where it appears from a device, and is processed by each layer in the context of this namespace. Non-root namespaces communicate with the outside world in two ways: by owning hardware devices, or receiving packets forwarded them by their parent namespace via pass-through device. This complete separation of namespaces is very useful for at least two purposes: - allowing users to create and manage by their own various tunnels and VPNs, and - enabling easier and more straightforward live migration of groups of processes with their environment. I conceptually prefer this approach, but I seem to recall there were actual problems in using this for checkpoint/restart of lightweight (application) containers. Performance aside, are there any reasons why this approach would be problematic for c/r? I'm afraid Daniel may be on vacation, and don't know who else other than Eric might have thoughts on this. 2. People expressed concerns that complete separation of namespaces may introduce an undesired overhead in certain usage scenarios. The overhead comes from packets traversing input path, then output path, then input path again in the destination namespace if root namespace acts as a router. So, we may introduce short-cuts, when input packet starts to be processes in one namespace, but changes it at some upper layer. The places where packet can change namespace are, for example: routing, post-routing netfilter hook, or even lookup in socket hash. The cleanest example among them is post-routing netfilter hook. Tagging of input packets there means that the packets is checked against root namespace's routing table, found to be local, and go directly to the socket hash lookup in the destination namespace. In this scheme the ability to change routing tables or netfilter rules on a per-namespace basis is traded for lower overhead. All other optimized schemes where input packets do not travel input-output-input paths in general case may be viewed as short-cuts in scheme (1). The remaining question is which exactly short-cuts make most sense, and how to make them consistent from the interface point of view. My current idea is to reach some agreement on the basic concept, review patches, and then move on to implementing feasible short-cuts. Opinions? Next in this thread are patches introducing namespaces to device list, IPv4 routing, and socket hashes, and a pass-through device. Patches are against 2.6.18-rc4-mm1. Just to provide the extreme other end of implementation options, here is the bsdjail based version I've been using for some testing while waiting for network namespaces to show up in -mm :) (Not intended for *any* sort of inclusion consideration :) Example usage: ifconfig eth0:0 192.168.1.16 echo -n ip 192.168.1.16 /proc/$$/attr/exec exec /bin/sh -serge From: Serge E. Hallyn [EMAIL PROTECTED](none) Date: Wed, 26 Jul 2006 21:47:13 -0500 Subject: [PATCH 1/1] bsdjail: define bsdjail lsm Define the actual bsdjail LSM. Signed-off-by: Serge E. Hallyn [EMAIL PROTECTED] --- security/Kconfig | 11 security/Makefile |1 security/bsdjail.c | 1351 3 files changed, 1363 insertions(+), 0 deletions(-) diff --git a/security/Kconfig b/security/Kconfig index 67785df..fa30e40 100644 --- a/security/Kconfig +++ b/security/Kconfig @@ -105,6 +105,17 @@ config SECURITY_SECLVL If you are unsure how to answer this question, answer N. +config SECURITY_BSDJAIL + tristate BSD Jail LSM + depends on SECURITY + select SECURITY_NETWORK + help + Provides BSD Jail compartmentalization functionality. + See Documentation/bsdjail.txt for more information and + usage instructions. + + If you are unsure how to answer this question, answer N. + source security/selinux/Kconfig endmenu diff --git a/security/Makefile b/security/Makefile index 8cbbf2f..050b588 100644 --- a/security/Makefile +++ b/security/Makefile @@ -17,3 +17,4 @@ obj-$(CONFIG_SECURITY_SELINUX)+= selin obj-$(CONFIG_SECURITY_CAPABILITIES)+= commoncap.o capability.o obj-$(CONFIG_SECURITY_ROOTPLUG)+= commoncap.o root_plug.o obj-$(CONFIG_SECURITY_SECLVL) +=
Re: [RFC] network namespaces
Hello! (application) containers. Performance aside, are there any reasons why this approach would be problematic for c/r? This approach is just perfect for c/r. Probably, this is the only approach when migration can be done in a clean and self-consistent way. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
Alexey Kuznetsov [EMAIL PROTECTED] writes: Hello! (application) containers. Performance aside, are there any reasons why this approach would be problematic for c/r? This approach is just perfect for c/r. Yes. For c/r you need to take your state with you. Probably, this is the only approach when migration can be done in a clean and self-consistent way. Basically there are currently 3 approaches that have been proposed. The trivial bsdjail style as implemented by Serge and in a slightly more sophisticated version in vserver. This approach as it does not touch the packets has little to no packet level overhead. Basically this is what I have called the Level 3 approach. The more in depth approach where we modify the packet processing based upon which network interface the packet comes in on, and it looks like each namespace has it's own instance of the network stack. Roughly what was proposed earlier in this thread the Level 2 approach. This potentially has per packet overhead so we need to watch the implementation very carefully. Some weird hybrid as proposed by Daniel, that I was never clear on the semantics. From the previous conversations my impression was that as long as we could get a Layer 2 approach that did not slow down the networking stack and was clean, everyone would be happy. I'm buried in the process id namespace at the moment, and except to be so for the rest of the month, so I'm not going to be very helpful except for a few stray comments. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Network namespaces a path to mergable code.
Hello, Eric W. Biederman wrote: Thinking about this I am going to suggest a slightly different direction for get a patchset we can merge. First we concentrate on the fundamentals. - How we mark a device as belonging to a specific network namespace. - How we mark a socket as belonging to a specific network namespace. As part of the fundamentals we add a patch to the generic socket code that by default will disable it for protocol families that do not indicate support for handling network namespaces, on a non-default network namespace. I think that gives us a path that will allow us to convert the network stack one protocol family at a time instead of in one big lump. Stubbing off the sysfs and sysctl interfaces in the first round for the non-default namespaces as you have done should be good enough. The reason for the suggestion is that most of the work for the protocol stacks ipv4 ipv6 af_packet af_unix is largely noise, and simple replacement without real design work happening. Mostly it is just tweaking the code to remove global variables, and doing a couple lookups. How that proposal differs from the initial Daniel's patchset ? how far was that patchset to reach a similar agreement ? OK, i wear blue socks :), but I'm not advocating a patchset more than another i'm just looking for a shorter path. thanks, C. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Network namespaces a path to mergable code.
Cedric Le Goater [EMAIL PROTECTED] writes: How that proposal differs from the initial Daniel's patchset ? how far was that patchset to reach a similar agreement ? My impression is as follows. The OpenVz implementation and mine work on the same basic principles of handling the network stack at layer 2. We have our implementation differences but the core ideas are about the same. Daniels patch still had elements of layer 3 handling as I recall and that has problems. OK, i wear blue socks :), but I'm not advocating a patchset more than another i'm just looking for a shorter path. Besides laying the foundations. The current conversation seems to be about understanding the implications of the network stack when we implement a network namespace. There is a lot to the networking stack so it takes a while. In addition this is one part of the problem that everyone has implemented, so we have several more opinions on how it should be done and what needs to happen. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] Network namespaces a path to mergable code.
Thinking about this I am going to suggest a slightly different direction for get a patchset we can merge. First we concentrate on the fundamentals. - How we mark a device as belonging to a specific network namespace. - How we mark a socket as belonging to a specific network namespace. As part of the fundamentals we add a patch to the generic socket code that by default will disable it for protocol families that do not indicate support for handling network namespaces, on a non-default network namespace. I think that gives us a path that will allow us to convert the network stack one protocol family at a time instead of in one big lump. Stubbing off the sysfs and sysctl interfaces in the first round for the non-default namespaces as you have done should be good enough. The reason for the suggestion is that most of the work for the protocol stacks ipv4 ipv6 af_packet af_unix is largely noise, and simple replacement without real design work happening. Mostly it is just tweaking the code to remove global variables, and doing a couple lookups. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html