Re: [RFC] network namespaces

2006-10-04 Thread Daniel Lezcano

Andrey Savochkin wrote:

Hi All,

I'd like to resurrect our discussion about network namespaces.
In our previous discussions it appeared that we have rather polar concepts
which seemed hard to reconcile.
Now I have an idea how to look at all discussed concepts to enable everyone's
usage scenario.


Hi Andrey,

I have a few questions ... sorry for asking so late ;)



1. The most straightforward concept is complete separation of namespaces,
   covering device list, routing tables, netfilter tables, socket hashes, and
   everything else.

   On input path, each packet is tagged with namespace right from the
   place where it appears from a device, and is processed by each layer
   in the context of this namespace.


If you have the namespace where is coming the packet, why do you tag the 
packet instead of switching to the right namespace ?



   Non-root namespaces communicate with the outside world in two ways: by
   owning hardware devices, or receiving packets forwarded them by their parent
   namespace via pass-through device.


Do you will do proxy arp and ip forwarding into the root namespace in 
order to make non-root namespace visible from the outside world ?


Regards.

-- Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-09-12 Thread Dmitry Mishin
Sorry, dont' understand your proposal correctly from the previous talk. :)
But...

On Tuesday 12 September 2006 07:28, Eric W. Biederman wrote:
 Do you have some concrete arguments against the proposal?
Yes, I have. I think it is unnecessary complication. This complication will 
followed in additional bugs. Especially if we'll accept rules creation in 
userspace. Why we need complex solution, if there are only two approaches to  
socket bound - isolation and virtualization? These approaches could co-exist 
without hooks. Or you probably have thoughts about other ways?

-- 
Thanks,
Dmitry.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [RFC] network namespaces

2006-09-11 Thread Daniel Lezcano

Dmitry Mishin wrote:

On Friday 08 September 2006 22:11, Herbert Poetzl wrote:


actually the light-weight ip isolation runs perfectly
fine _without_ CAP_NET_ADMIN, as you do not want the
guest to be able to mess with the 'configured' ips at
all (not to speak of interfaces here)


It was only an example. I'm thinking about how to implement flexible solution, 
which permits light-weight ip isolation as well as full-fledged netwrok 
virtualization. Another solution is to split CONFIG_NET_NAMESPACE. Is it good 
for you?


Hi Dmitry,

I am currently working on this and I am finishing a prototype bringing 
isolation at the ip layer. The prototype code is very closed to Andrey's 
patches at TCP/UDP level. So the next step is to merge the prototype 
code with the existing network namespace layer 2 isolation.


IHMO, the solution of spliting CONFIG_NET_NS into CONFIG_L2_NET_NS and 
CONFIG_L3_NET_NS is for me not acceptable because you will need to 
recompile the kernel. The proper way is certainly to have a specific 
flag for the unshare, something like CLONE_NEW_L2_NET and 
CLONE_NEW_L3_NET for example.


  -- Daniel

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [RFC] network namespaces

2006-09-11 Thread Herbert Poetzl
On Mon, Sep 11, 2006 at 04:40:59PM +0200, Daniel Lezcano wrote:
 Dmitry Mishin wrote:
 On Friday 08 September 2006 22:11, Herbert Poetzl wrote:
 
 actually the light-weight ip isolation runs perfectly
 fine _without_ CAP_NET_ADMIN, as you do not want the
 guest to be able to mess with the 'configured' ips at
 all (not to speak of interfaces here)
 
 It was only an example. I'm thinking about how to implement flexible 
 solution, which permits light-weight ip isolation as well as full-fledged 
 netwrok virtualization. Another solution is to split CONFIG_NET_NAMESPACE. 
 Is it good for you?
 
 Hi Dmitry,
 
 I am currently working on this and I am finishing a prototype bringing
 isolation at the ip layer. The prototype code is very closed to
 Andrey's patches at TCP/UDP level. So the next step is to merge the
 prototype code with the existing network namespace layer 2 isolation.

you might want to take a look at the current Linux-VServer
implementation for the network isolation too, should be
quite similar to Andrey's approach, but maybe you can
gather some additional information from there

 IHMO, the solution of spliting CONFIG_NET_NS into CONFIG_L2_NET_NS
 and CONFIG_L3_NET_NS is for me not acceptable because you will need
 to recompile the kernel. The proper way is certainly to have a
 specific flag for the unshare, something like CLONE_NEW_L2_NET and
 CLONE_NEW_L3_NET for example.

I completely agree here, we need a separate namespace
for that, so that we can combine isolation and virtualization
as needed, unless the bind restrictions can be completely
expressed with an additional mangle or filter table (as
was suggested)

best,
Herbert

   -- Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [RFC] network namespaces

2006-09-11 Thread Daniel Lezcano

Herbert Poetzl wrote:

On Mon, Sep 11, 2006 at 04:40:59PM +0200, Daniel Lezcano wrote:




I am currently working on this and I am finishing a prototype bringing
isolation at the ip layer. The prototype code is very closed to
Andrey's patches at TCP/UDP level. So the next step is to merge the
prototype code with the existing network namespace layer 2 isolation.



you might want to take a look at the current Linux-VServer
implementation for the network isolation too, should be
quite similar to Andrey's approach, but maybe you can
gather some additional information from there


ok, thanks. I will.


IHMO, the solution of spliting CONFIG_NET_NS into CONFIG_L2_NET_NS
and CONFIG_L3_NET_NS is for me not acceptable because you will need
to recompile the kernel. The proper way is certainly to have a
specific flag for the unshare, something like CLONE_NEW_L2_NET and
CLONE_NEW_L3_NET for example.



I completely agree here, we need a separate namespace
for that, so that we can combine isolation and virtualization
as needed, unless the bind restrictions can be completely
expressed with an additional mangle or filter table (as
was suggested)


What is the bind restriction ? Do you want to force binding to a 
specific source address ?


  -- Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [RFC] network namespaces

2006-09-11 Thread Dmitry Mishin
On Monday 11 September 2006 18:57, Herbert Poetzl wrote:
 I completely agree here, we need a separate namespace
 for that, so that we can combine isolation and virtualization
 as needed, unless the bind restrictions can be completely
 expressed with an additional mangle or filter table (as
 was suggested)
iptables are designed for packet flow decisions and filtering, it has nothing 
common with bind restrictions. So, it may be only packet flow 
scheduling/filtering, but it will not help to resolve bind-time IP conflicts.

-- 
Thanks,
Dmitry.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-09-11 Thread Eric W. Biederman
Dmitry Mishin [EMAIL PROTECTED] writes:

 On Sunday 10 September 2006 06:47, Herbert Poetzl wrote:
 well, I think it would be best to have both, as
 they are complementary to some degree, and IMHO
 both, the full virtualization _and_ the isolation
 will require a separate namespace to work,   
 [snip]
 I do not think that folks would want to recompile
 their kernel just to get a light-weight guest or
 a fully virtualized one
 In this case light-weight guest will have unnecessary overhead.
 For example, instead of using static pointer, we have to find the required 
 common namespace before. And there will be no advantages for such guest over 
 full-featured.

Dmitry that just isn't true if implemented properly.  

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-09-11 Thread Eric W. Biederman
Dmitry Mishin [EMAIL PROTECTED] writes:

 On Monday 11 September 2006 18:57, Herbert Poetzl wrote:
 I completely agree here, we need a separate namespace
 for that, so that we can combine isolation and virtualization
 as needed, unless the bind restrictions can be completely
 expressed with an additional mangle or filter table (as
 was suggested)

 iptables are designed for packet flow decisions and filtering, it has nothing 
 common with bind restrictions. So, it may be only packet flow 
 scheduling/filtering, but it will not help to resolve bind-time IP conflicts.

Please read the archive, where the suggestion was made.

What was suggested was a new table, with it's own set of chains.
So we could make filtering decisions on where sockets could be bound.

That is not a far stretch from where iptables is today.

Do you have some concrete arguments against the proposal?

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [RFC] network namespaces

2006-09-10 Thread Dmitry Mishin
On Sunday 10 September 2006 06:47, Herbert Poetzl wrote:
 well, I think it would be best to have both, as
 they are complementary to some degree, and IMHO
 both, the full virtualization _and_ the isolation
 will require a separate namespace to work,   
[snip]
 I do not think that folks would want to recompile
 their kernel just to get a light-weight guest or
 a fully virtualized one
In this case light-weight guest will have unnecessary overhead.
For example, instead of using static pointer, we have to find the required 
common namespace before. And there will be no advantages for such guest over 
full-featured.


 best,
 Herbert

  --
  Thanks,
  Dmitry.

-- 
Thanks,
Dmitry.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [RFC] network namespaces

2006-09-10 Thread Dmitry Mishin
On Sunday 10 September 2006 07:41, Eric W. Biederman wrote:
 I certainly agree that we are not at a point where a final decision
 can be made.  A major piece of that is that a layer 2 approach has
 not shown to be without a performance penalty.
But it is required. Why to limit possible usages?
 
 A practical question.  Do the IPs assigned to guests ever get used
 by anything besides the guest?
In case of level2 virtualization - no.

-- 
Thanks,
Dmitry.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-09-10 Thread Eric W. Biederman
Dmitry Mishin [EMAIL PROTECTED] writes:

 On Sunday 10 September 2006 07:41, Eric W. Biederman wrote:
 I certainly agree that we are not at a point where a final decision
 can be made.  A major piece of that is that a layer 2 approach has
 not shown to be without a performance penalty.
 But it is required. Why to limit possible usages?

Wrong perspective.

The point is that we need to dig in and show that there is no
measurable penalty for the current cases.  Showing that there
is little penalty for the advanced configurations is a plus.

The practical question is, do we need to implement the grand unified
lookup before we can do this cheaply, or can we implement this without
needing that optimization?

To get a perspective, to get a good implementation of the pid namespace
I am having to refactor significant parts of the kernel so it uses
abstractions that can cleanly express what we are doing.  The
networking stack is in better shape but there is a lot of it. 

 A practical question.  Do the IPs assigned to guests ever get used
 by anything besides the guest?
 In case of level2 virtualization - no.

Actually that is one of the benefits of a layer 2 implementation
you can set up weird things like shared IPs, that various types
of fail over scenarios want.

My question was really about the layer 3 bind filtering techniques,
and how people are using them.

The basic attraction with layer 3 is that you can do a simple
implementation, and it will run very fast, and it doesn't need
to conflict with the layer 2 work at all.  If you can make that layer
3 implementation clean and generally mergeable  as well it is worth
pursuing.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [RFC] network namespaces

2006-09-10 Thread Herbert Poetzl
On Sat, Sep 09, 2006 at 09:41:35PM -0600, Eric W. Biederman wrote:
 Herbert Poetzl [EMAIL PROTECTED] writes:
 
  On Sat, Sep 09, 2006 at 11:57:24AM +0400, Dmitry Mishin wrote:
  On Friday 08 September 2006 22:11, Herbert Poetzl wrote:
   actually the light-weight ip isolation runs perfectly
   fine _without_ CAP_NET_ADMIN, as you do not want the
   guest to be able to mess with the 'configured' ips at
   all (not to speak of interfaces here)
 
  It was only an example. I'm thinking about how to implement flexible
  solution, which permits light-weight ip isolation as well as
  full-fledged netwrok virtualization. Another solution is to split
  CONFIG_NET_NAMESPACE. Is it good for you?
 
  well, I think it would be best to have both, as
  they are complementary to some degree, and IMHO
  both, the full virtualization _and_ the isolation
  will require a separate namespace to work, I also
  think that limiting the isolation to something
  very simple (like one IP + network or so) would
  be acceptable for a start, because especially
  multi IP or network range checks require a little
  more efford to get them right ...
 
  I do not think that folks would want to recompile
  their kernel just to get a light-weight guest or
  a fully virtualized one
 
 I certainly agree that we are not at a point where a final decision
 can be made.  A major piece of that is that a layer 2 approach has
 not shown to be without a performance penalty.
 
 A practical question.  Do the IPs assigned to guests ever get used
 by anything besides the guest?

only in special setups and for testing routing and
general operation of course, i.e. one typical
failure scenario is this:

 - 'provider' has a bunch of ips assigned
 - 'host' ip works perfectly
 - 'guest' ip is not routed (by the external router)

in this case, for example, I always suggest to test
on the host with a guest ip, simplest example:

 ping -I guest-ip google.com

but for 'normal' operation, the guest ip is reserved
for the guests, unless some service like named is
shared between guests ...

HTH,
Herbert

 Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [RFC] network namespaces

2006-09-10 Thread Herbert Poetzl
On Sun, Sep 10, 2006 at 11:45:35AM +0400, Dmitry Mishin wrote:
 On Sunday 10 September 2006 06:47, Herbert Poetzl wrote:
  well, I think it would be best to have both, as
  they are complementary to some degree, and IMHO
  both, the full virtualization _and_ the isolation
  will require a separate namespace to work,   
 [snip]
  I do not think that folks would want to recompile
  their kernel just to get a light-weight guest or
  a fully virtualized one

 In this case light-weight guest will have unnecessary overhead. For
 example, instead of using static pointer, we have to find the required
 common namespace before. 

this is only required at 'bind' time, which makes
a non measurable fraction of the actual connection
usage (unless you keep binding ports over and over
without ever using them)

 And there will be no advantages for such guest over full-featured.

the advantage is in the flexibility, simplicity of
setup and the basically non-existant overhead on
the hot (conenction/transfer) part ...

  best,
  Herbert
 
   --
   Thanks,
   Dmitry.
 
 -- 
 Thanks,
 Dmitry.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [RFC] network namespaces

2006-09-09 Thread Dmitry Mishin
On Friday 08 September 2006 22:11, Herbert Poetzl wrote:
 actually the light-weight ip isolation runs perfectly
 fine _without_ CAP_NET_ADMIN, as you do not want the
 guest to be able to mess with the 'configured' ips at
 all (not to speak of interfaces here)
It was only an example. I'm thinking about how to implement flexible solution, 
which permits light-weight ip isolation as well as full-fledged netwrok 
virtualization. Another solution is to split CONFIG_NET_NAMESPACE. Is it good 
for you?

-- 
Thanks,
Dmitry.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [RFC] network namespaces

2006-09-09 Thread Herbert Poetzl
On Sat, Sep 09, 2006 at 11:57:24AM +0400, Dmitry Mishin wrote:
 On Friday 08 September 2006 22:11, Herbert Poetzl wrote:
  actually the light-weight ip isolation runs perfectly
  fine _without_ CAP_NET_ADMIN, as you do not want the
  guest to be able to mess with the 'configured' ips at
  all (not to speak of interfaces here)

 It was only an example. I'm thinking about how to implement flexible
 solution, which permits light-weight ip isolation as well as
 full-fledged netwrok virtualization. Another solution is to split
 CONFIG_NET_NAMESPACE. Is it good for you?

well, I think it would be best to have both, as
they are complementary to some degree, and IMHO
both, the full virtualization _and_ the isolation
will require a separate namespace to work, I also
think that limiting the isolation to something
very simple (like one IP + network or so) would
be acceptable for a start, because especially
multi IP or network range checks require a little
more efford to get them right ...

I do not think that folks would want to recompile
their kernel just to get a light-weight guest or
a fully virtualized one

best,
Herbert

 -- 
 Thanks,
 Dmitry.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [RFC] network namespaces

2006-09-09 Thread Eric W. Biederman
Herbert Poetzl [EMAIL PROTECTED] writes:

 On Sat, Sep 09, 2006 at 11:57:24AM +0400, Dmitry Mishin wrote:
 On Friday 08 September 2006 22:11, Herbert Poetzl wrote:
  actually the light-weight ip isolation runs perfectly
  fine _without_ CAP_NET_ADMIN, as you do not want the
  guest to be able to mess with the 'configured' ips at
  all (not to speak of interfaces here)

 It was only an example. I'm thinking about how to implement flexible
 solution, which permits light-weight ip isolation as well as
 full-fledged netwrok virtualization. Another solution is to split
 CONFIG_NET_NAMESPACE. Is it good for you?

 well, I think it would be best to have both, as
 they are complementary to some degree, and IMHO
 both, the full virtualization _and_ the isolation
 will require a separate namespace to work, I also
 think that limiting the isolation to something
 very simple (like one IP + network or so) would
 be acceptable for a start, because especially
 multi IP or network range checks require a little
 more efford to get them right ...

 I do not think that folks would want to recompile
 their kernel just to get a light-weight guest or
 a fully virtualized one

I certainly agree that we are not at a point where a final decision
can be made.  A major piece of that is that a layer 2 approach has
not shown to be without a performance penalty.

A practical question.  Do the IPs assigned to guests ever get used
by anything besides the guest?

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-09-08 Thread Herbert Poetzl
On Thu, Sep 07, 2006 at 12:29:21PM -0600, Eric W. Biederman wrote:
 Daniel Lezcano [EMAIL PROTECTED] writes:
 
  IHMO, I think there is one reason. The unsharing mechanism is
  not only for containers, its aim other kind of isolation like a
  bsdjail for example. The unshare syscall is flexible, shall the
  network unsharing be one-block solution ? For example, we want to
  launch an application using TCP/IP and we want to have
  an IP address only used by the application, nothing more.
  With a layer 2, we must after unsharing:
   1) create a virtual device into the application namespace
   2) assign an IP address
   3) create a virtual device pass-through in the root namespace
   4) set the virtual device IP
 
  All this stuff, need a lot of administration (check mac addresses
  conflicts, check interface names collision in root namespace, ...)
  for a simple network isolation.
 
 Yes, and even more it is hard to show that it will perform as well.
 Although by dropping CAP_NET_ADMIN the actual runtime administration
 is about the same.
 
  With a layer 3:
   1) assign an IP address
 
  In the other hand, a layer 3 isolation is not sufficient to reach
  the level of isolation/virtualization needed for the system
  containers.
 
 Agreed.
 
  Very soon, I will commit more info at:
 
  http://wiki.openvz.org/Containers/Networking
 
  So the consensus is based on the fact that there is a lot of common
  code for the layer 2 and layer 3 isolation/virtualization and we can
  find a way to merge the 2 implementation in order to have a flexible
  network virtualization/isolation.
 
 NACK In a real level 3 implementation there is very little common
 code with a layer 2 implementation. You don't need to muck with the
 socket handling code as you are not allowed to dup addresses between
 containers. Look at what Serge did that is layer 3.

 A layer 3 isolation implementation should either be a new security
 module or a new form of iptables. The problem with using the lsm is
 that it seems to be an all or nothing mechanism so is a very coarse
 grained tool for this job.

IMHO LSM was never an option for that, because it is
a) very complicated to use it for that purpose
b) missing many hooks you definitely need to make this work
c) is not really efficient and/or performant

with something 'like' iptables, this could be done, but
I'm not sure that is the best approach either ...

best,
Herbert

 A layer 2 implementation (where you have network devices isolated and
 not sockets) should be a namespace.
 
 Eric
 ___
 Containers mailing list
 [EMAIL PROTECTED]
 https://lists.osdl.org/mailman/listinfo/containers
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [RFC] network namespaces

2006-09-08 Thread Dmitry Mishin
On Thursday 07 September 2006 21:27, Herbert Poetzl wrote:
 well, who said that you need to have things like RAW sockets
 or other protocols except IP, not to speak of iptable and
 routing entries ...

 folks who _want_ full network virtualization can use the
 more complete virtual setup and be happy ...
Let's think about how to implement this.
As I understood VServer's design, your proposal is to split CAP_NET_ADMIN to
multiple capabilities and use them if required. So, for your light-weight 
container it is enough to implement context isolation for protected by 
CAP_NET_IP capability (for example) code and put 'if (!capable(CAP_NET_*))' 
checks to all other places. But this could be easily implemented over OpenVZ 
code by CAP_VE_NET_ADMIN split.

So, the question is:
Could you point out the places in Andrey's implementation of network 
namespaces, which prevents you to add CAP_NET_ADMIN separation later?

-- 
Thanks,
Dmitry.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [RFC] network namespaces

2006-09-08 Thread Herbert Poetzl
On Fri, Sep 08, 2006 at 05:10:08PM +0400, Dmitry Mishin wrote:
 On Thursday 07 September 2006 21:27, Herbert Poetzl wrote:
  well, who said that you need to have things like RAW sockets
  or other protocols except IP, not to speak of iptable and
  routing entries ...
 
  folks who _want_ full network virtualization can use the
  more complete virtual setup and be happy ...

 Let's think about how to implement this.

 As I understood VServer's design, your proposal is to split
 CAP_NET_ADMIN to multiple capabilities and use them if required. So,
 for your light-weight container it is enough to implement context
 isolation for protected by CAP_NET_IP capability (for example) code
 and put 'if (!capable(CAP_NET_*))' checks to all other places. 

actually the light-weight ip isolation runs perfectly
fine _without_ CAP_NET_ADMIN, as you do not want the
guest to be able to mess with the 'configured' ips at
all (not to speak of interfaces here)

best,
Herbert

 But this could be easily implemented over OpenVZ code by
 CAP_VE_NET_ADMIN split.
 
 So, the question is:
 Could you point out the places in Andrey's implementation of network
 namespaces, which prevents you to add CAP_NET_ADMIN separation later?
 
 -- 
 Thanks,
 Dmitry.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-09-07 Thread Daniel Lezcano

Caitlin Bestler wrote:

[EMAIL PROTECTED] wrote:
 


Finally, as I understand both network isolation and network
virtualization (both level2 and level3) can happily co-exist. We do
have several filesystems in kernel. Let's have several network
virtualization approaches, and let a user choose. Is that makes
sense? 


If there are not compelling arguments for using both ways of
doing it is silly to merge both, as it is more maintenance overhead.




My reading is that full virtualization (Xen, etc.) calls for
implementing
L2 switching between the partitions and the physical NIC(s).

The tradeoffs between L2 and L3 switching are indeed complex, but
there are two implications of doing L2 switching between partitions:

1) Do we really want to ask device drivers to support L2 switching for
   partitions and something *different* for containers?

2) Do we really want any single packet to traverse an L2 switch (for
   the partition-style virtualization layer) and then an L3 switch
   (for the container-style layer)?

The full virtualization solution calls for virtual NICs with distinct
MAC addresses. Is there any reason why this same solution cannot work
for containers (just creating more than one VNIC for the partition, 
and then assigning each VNIC to a container?)


IHMO, I think there is one reason. The unsharing mechanism is not only 
for containers, its aim other kind of isolation like a bsdjail for 
example. The unshare syscall is flexible, shall the network unsharing be 
one-block solution ? For example, we want to launch an application using 
TCP/IP and we want to have an IP address only used by the application, 
nothing more.

With a layer 2, we must after unsharing:
 1) create a virtual device into the application namespace
 2) assign an IP address
 3) create a virtual device pass-through in the root namespace
 4) set the virtual device IP

All this stuff, need a lot of administration (check mac addresses 
conflicts, check interface names collision in root namespace, ...) for a 
simple network isolation.


With a layer 3:
 1) assign an IP address

In the other hand, a layer 3 isolation is not sufficient to reach the 
level of isolation/virtualization needed for the system containers.


Very soon, I will commit more info at:

http://wiki.openvz.org/Containers/Networking

So the consensus is based on the fact that there is a lot of common code 
for the layer 2 and layer 3 isolation/virtualization and we can find a 
way to merge the 2 implementation in order to have a flexible network 
virtualization/isolation.


  -- Regards

Daniel.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [RFC] network namespaces

2006-09-07 Thread Kirill Korotaev

Herbert Poetzl wrote:


my point (until we have an implementation which clearly
shows that performance is equal/better to isolation)
is simply this:

of course, you can 'simulate' or 'construct' all the
isolation scenarios with kernel bridging and routing
and tricky injection/marking of packets, but, this
usually comes with an overhead ...
 


Well, TANSTAAFL*, and pretty much everything comes with an overhead. 
Multitasking comes with the (scheduler, context switch, CPU cache, etc.) 
overhead -- is that the reason to abandon it? OpenVZ and Linux-VServer 
resource management also adds some overhead -- do we want to throw it away?


The question is not just equal or better performance, the question is 
what do we get and how much we pay for it.



Equal or better performance is certainly required when we have the code
compiled in but aren't using it.  We must not penalize the current code.

you talk about host system performance.
Both approaches do not introduce overhead to host networking.

Finally, as I understand both network isolation and network 
virtualization (both level2 and level3) can happily co-exist. We do have 
several filesystems in kernel. Let's have several network virtualization 
approaches, and let a user choose. Is that makes sense?



If there are not compelling arguments for using both ways of doing
it is silly to merge both, as it is more maintenance overhead.

That said I think there is a real chance if we can look at the bind
filtering and find a way to express that in the networking stack
through iptables.  Using the security hooks conflicts with things
like selinux.   Although it would be interesting to see if selinux
can already implement general purpose layer 3 filtering.

The more I look the gut feel I have is that the way to proceed would
be to add a new table that filters binds, and connects.  Plus a new
module that would look at a process creating a socket and tell us if
it is the appropriate group of processes.  With a little care that
would be a general solution to the layer 3 filtering problem.

Huh, you will still have to insert lots of access checks into different
parts of code like RAW sockets, netlinks, protocols which are not inserted,
netfilters (to not allow create iptables rules :) ) and many many other places.

I see Dave Miller looking at such a patch and my ears hear his rude words :)

Thanks,
Kirill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [RFC] network namespaces

2006-09-07 Thread Herbert Poetzl
On Thu, Sep 07, 2006 at 08:23:53PM +0400, Kirill Korotaev wrote:
 Herbert Poetzl wrote:
 
 my point (until we have an implementation which clearly
 shows that performance is equal/better to isolation)
 is simply this:
 
  of course, you can 'simulate' or 'construct' all the
  isolation scenarios with kernel bridging and routing
  and tricky injection/marking of packets, but, this
  usually comes with an overhead ...
   
 
 Well, TANSTAAFL*, and pretty much everything comes with an overhead. 
 Multitasking comes with the (scheduler, context switch, CPU cache, etc.) 
 overhead -- is that the reason to abandon it? OpenVZ and Linux-VServer 
 resource management also adds some overhead -- do we want to throw it away?
 
 The question is not just equal or better performance, the question is 
 what do we get and how much we pay for it.
  
  
  Equal or better performance is certainly required when we have the code
  compiled in but aren't using it.  We must not penalize the current code.
 you talk about host system performance.
 Both approaches do not introduce overhead to host networking.
 
 Finally, as I understand both network isolation and network 
 virtualization (both level2 and level3) can happily co-exist. We do have 
 several filesystems in kernel. Let's have several network virtualization 
 approaches, and let a user choose. Is that makes sense?
  
  
  If there are not compelling arguments for using both ways of doing
  it is silly to merge both, as it is more maintenance overhead.
  
  That said I think there is a real chance if we can look at the bind
  filtering and find a way to express that in the networking stack
  through iptables.  Using the security hooks conflicts with things
  like selinux.   Although it would be interesting to see if selinux
  can already implement general purpose layer 3 filtering.
  
  The more I look the gut feel I have is that the way to proceed would
  be to add a new table that filters binds, and connects.  Plus a new
  module that would look at a process creating a socket and tell us if
  it is the appropriate group of processes.  With a little care that
  would be a general solution to the layer 3 filtering problem.

 Huh, you will still have to insert lots of access checks into
 different parts of code like RAW sockets, netlinks, protocols which
 are not inserted, netfilters (to not allow create iptables rules :) )
 and many many other places.

well, who said that you need to have things like RAW sockets
or other protocols except IP, not to speak of iptable and 
routing entries ...

folks who _want_ full network virtualization can use the
more complete virtual setup and be happy ...

best,
Herbert

 I see Dave Miller looking at such a patch and my ears hear his rude
 words :)
 
 Thanks,
 Kirill
 ___
 Containers mailing list
 [EMAIL PROTECTED]
 https://lists.osdl.org/mailman/listinfo/containers
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-09-07 Thread Eric W. Biederman
Daniel Lezcano [EMAIL PROTECTED] writes:

 IHMO, I think there is one reason. The unsharing mechanism is not only for
 containers, its aim other kind of isolation like a bsdjail for example. The
 unshare syscall is flexible, shall the network unsharing be one-block 
 solution ?
 For example, we want to launch an application using TCP/IP and we want to have
 an IP address only used by the application, nothing more.
 With a layer 2, we must after unsharing:
  1) create a virtual device into the application namespace
  2) assign an IP address
  3) create a virtual device pass-through in the root namespace
  4) set the virtual device IP

 All this stuff, need a lot of administration (check mac addresses conflicts,
 check interface names collision in root namespace, ...) for a simple network
 isolation.

Yes, and even more it is hard to show that it will perform as well.
Although by dropping CAP_NET_ADMIN the actual runtime administration
is about the same.

 With a layer 3:
  1) assign an IP address

 In the other hand, a layer 3 isolation is not sufficient to reach the level of
 isolation/virtualization needed for the system containers.

Agreed.

 Very soon, I will commit more info at:

 http://wiki.openvz.org/Containers/Networking

 So the consensus is based on the fact that there is a lot of common code for 
 the
 layer 2 and layer 3 isolation/virtualization and we can find a way to merge 
 the
 2 implementation in order to have a flexible network virtualization/isolation.

NACK  In a real level 3 implementation there is very little common code with
a layer 2 implementation.  You don't need to muck with the socket handling
code as you are not allowed to dup addresses between containers.  Look
at what Serge did that is layer 3.

A layer 3 isolation implementation should either be a new security module
or a new form of iptables.  The problem with using the lsm is that it
seems to be an all or nothing mechanism so is a very coarse grained
tool for this job.

A layer 2 implementation (where you have network devices isolated and not 
sockets)
should be a namespace.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [RFC] network namespaces

2006-09-07 Thread Eric W. Biederman
Herbert Poetzl [EMAIL PROTECTED] writes:

 On Thu, Sep 07, 2006 at 08:23:53PM +0400, Kirill Korotaev wrote:

 well, who said that you need to have things like RAW sockets
 or other protocols except IP, not to speak of iptable and 
 routing entries ...

 folks who _want_ full network virtualization can use the
 more complete virtual setup and be happy ...

Exactly this was a proposal for isolation for containers
that don't get CAP_NET_ADMIN, with a facility that could
easily be general purpose.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-09-06 Thread Daniel Lezcano

Hi Herbert,


well, the 'ip subset' approach Linux-VServer and
other Jail solutions use is very clean, it just does
not match your expectations of a virtual interface
(as there is none) and it does not cope well with
all kinds of per context 'requirements', which IMHO
do not really exist on the application layer (only
on the whole system layer)

IMHO that would be quite simple, have a 'namespace'
for limiting port binds to a subset of the available
ips and another one which does complete network 
virtualization with all the whistles and bells, IMHO

most of them are orthogonal and can easily be combined

 - full network virtualization
 - lightweight ip subset 
 - both


IMHO this requirement only arises from the full system
virtualization approach, just look at the other jail
solutions (solaris, bsd, ...) some of them do not even 
allow for more than a single ip but they work quite

well when used properly ...


As far as I see, vserver use a layer 3 solution but, when needed, the 
veth component, made by Nestor Pena, is used to provide a layer 2 
virtualization. Right ?


Having the two solutions, you have certainly a lot if information about 
use cases. From the point of view of vserver, can you give some examples 
of when a layer 3 solution is better/worst than a layer 2 solution ? Who 
wants a layer 2/3 virtualization and why ?


These informations will be very useful.

Regards

  -- Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-09-06 Thread Kirill Korotaev

On Tue, Sep 05, 2006 at 08:45:39AM -0600, Eric W. Biederman wrote:


Daniel Lezcano [EMAIL PROTECTED] writes:

For HPC if you are interested in migration you need a separate IP
per container. If you can take you IP address with you migration of
networking state is simple. If you can't take your IP address with you
a network container is nearly pointless from a migration perspective.

Beyond that from everything I have seen layer 2 is just much cleaner
than any layer 3 approach short of Serge's bind filtering.


well, the 'ip subset' approach Linux-VServer and
other Jail solutions use is very clean, it just does
not match your expectations of a virtual interface
(as there is none) and it does not cope well with
all kinds of per context 'requirements', which IMHO
do not really exist on the application layer (only
on the whole system layer)



I probably expressed that wrong.  There are currently three
basic approaches under discussion.
Layer 3 (Basically bind filtering) nothing at the packet level.
   The approach taken by Serge's version of bsdjails and Vserver.

Layer 2.5 What Daniel proposed.

Layer 2.  (Trivially mapping each packet to a different interface)
   And then treating everything as multiple instances of the
   network stack.
Roughly what OpenVZ and I have implemented.

I think classifying network virtualization by Layer X is not good enough.
OpenVZ has Layer 3 (venet) and Layer 2 (veth) implementations, but
in both cases networking stack inside VE remains fully virtualized.

Thanks,
Kirill

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [RFC] network namespaces

2006-09-06 Thread Kir Kolyshkin

Kirill Korotaev wrote:

I think classifying network virtualization by Layer X is not good enough.
OpenVZ has Layer 3 (venet) and Layer 2 (veth) implementations, but
in both cases networking stack inside VE remains fully virtualized.
  
Let's describe all those (three?) approaches at 
http://wiki.openvz.org/Containers/Networking
Everyone is able to read and contribute to that, and (I hope) we will 
come to the common understanding. I have started the article, please 
enlarge.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-09-06 Thread Herbert Poetzl
On Wed, Sep 06, 2006 at 11:10:23AM +0200, Daniel Lezcano wrote:
 Hi Herbert,
 
 well, the 'ip subset' approach Linux-VServer and
 other Jail solutions use is very clean, it just does
 not match your expectations of a virtual interface
 (as there is none) and it does not cope well with
 all kinds of per context 'requirements', which IMHO
 do not really exist on the application layer (only
 on the whole system layer)
 
 IMHO that would be quite simple, have a 'namespace'
 for limiting port binds to a subset of the available
 ips and another one which does complete network 
 virtualization with all the whistles and bells, IMHO
 most of them are orthogonal and can easily be combined
 
  - full network virtualization
  - lightweight ip subset 
  - both
 
 IMHO this requirement only arises from the full system
 virtualization approach, just look at the other jail
 solutions (solaris, bsd, ...) some of them do not even 
 allow for more than a single ip but they work quite
 well when used properly ...
 
 As far as I see, vserver use a layer 3 solution but, when needed, the
 veth component, made by Nestor Pena, is used to provide a layer 2
 virtualization. Right ?

well, no, we do not explicitely use the VETH daemon
for networking, although some folks probably make use
of it, mainly because if you realize that this kind 
of isolation is something different and partially
complementary to network virtualization, you can do
live without the layer 2 virtualization in almost
all cases, nevertheless, for certain purposes layer
2/3 virtualization is required and/or makes perfect
sense

 Having the two solutions, you have certainly a lot if information
 about use cases. 

 From the point of view of vserver, can you give some
 examples of when a layer 3 solution is better/worst than 
 a layer 2 solution ? 

my point (until we have an implementation which clearly
shows that performance is equal/better to isolation)
is simply this:

 of course, you can 'simulate' or 'construct' all the
 isolation scenarios with kernel bridging and routing
 and tricky injection/marking of packets, but, this
 usually comes with an overhead ...

 Who wants a layer 2/3 virtualization and why ?

there are some reasons for virtualization instead of
pure isolation (as Linux-VServer does it for now)

 - context migration/snapshot (probably reason #1)
 - creating network devices inside a guest
   (can help with vpn and similar)
 - allowing non IP protocols (like DHCP, ICMP, etc)

the problem which arises with this kind of network
virtualization is that you need some additional policy
for example to avoid sending 'evil' packets and/or
(D)DoSing one guest from another, which again adds
further overhead, so basically if you 'just' want
to have network isolation, you have to do this:

 - create a 'copy' of your hosts networking inside
   the guest (with virtual interfaces)
 - assign all the same (subset) ips and this to
   the virtual guest interfaces
 - activate some smart bridging code which 'knows'
   what ports can be used and/or mapped 
 - add policy to block unwanted connections and/or
   packets to/from the guest

all this sounds very intrusive and for sure (please
proove me wrong here :) adds a lot of overhead to the
networking itself, while a 'simple' isolation approach
for IP (tcp/udp) is (almost) without any cost, certainly
without overhead once a connection is established.

 These informations will be very useful.

HTH,
Herbert

 Regards
 
   -- Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [RFC] network namespaces

2006-09-06 Thread Kir Kolyshkin

Herbert Poetzl wrote:

my point (until we have an implementation which clearly
shows that performance is equal/better to isolation)
is simply this:

 of course, you can 'simulate' or 'construct' all the
 isolation scenarios with kernel bridging and routing
 and tricky injection/marking of packets, but, this
 usually comes with an overhead ...
  
Well, TANSTAAFL*, and pretty much everything comes with an overhead. 
Multitasking comes with the (scheduler, context switch, CPU cache, etc.) 
overhead -- is that the reason to abandon it? OpenVZ and Linux-VServer 
resource management also adds some overhead -- do we want to throw it away?


The question is not just equal or better performance, the question is 
what do we get and how much we pay for it.


Finally, as I understand both network isolation and network 
virtualization (both level2 and level3) can happily co-exist. We do have 
several filesystems in kernel. Let's have several network virtualization 
approaches, and let a user choose. Is that makes sense?



* -- http://en.wikipedia.org/wiki/TANSTAAFL
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-09-06 Thread Eric W. Biederman
Herbert Poetzl [EMAIL PROTECTED] writes:

 On Wed, Sep 06, 2006 at 11:10:23AM +0200, Daniel Lezcano wrote:
 
 As far as I see, vserver use a layer 3 solution but, when needed, the
 veth component, made by Nestor Pena, is used to provide a layer 2
 virtualization. Right ?

 well, no, we do not explicitely use the VETH daemon
 for networking, although some folks probably make use
 of it, mainly because if you realize that this kind 
 of isolation is something different and partially
 complementary to network virtualization, you can do
 live without the layer 2 virtualization in almost
 all cases, nevertheless, for certain purposes layer
 2/3 virtualization is required and/or makes perfect
 sense

 Having the two solutions, you have certainly a lot if information
 about use cases. 

 From the point of view of vserver, can you give some
 examples of when a layer 3 solution is better/worst than 
 a layer 2 solution ? 

 my point (until we have an implementation which clearly
 shows that performance is equal/better to isolation)
 is simply this:

  of course, you can 'simulate' or 'construct' all the
  isolation scenarios with kernel bridging and routing
  and tricky injection/marking of packets, but, this
  usually comes with an overhead ...

 Who wants a layer 2/3 virtualization and why ?

 there are some reasons for virtualization instead of
 pure isolation (as Linux-VServer does it for now)

  - context migration/snapshot (probably reason #1)
  - creating network devices inside a guest
(can help with vpn and similar)
  - allowing non IP protocols (like DHCP, ICMP, etc)

 the problem which arises with this kind of network
 virtualization is that you need some additional policy
 for example to avoid sending 'evil' packets and/or
 (D)DoSing one guest from another, which again adds
 further overhead, so basically if you 'just' want
 to have network isolation, you have to do this:

  - create a 'copy' of your hosts networking inside
the guest (with virtual interfaces)
  - assign all the same (subset) ips and this to
the virtual guest interfaces
  - activate some smart bridging code which 'knows'
what ports can be used and/or mapped 
  - add policy to block unwanted connections and/or
packets to/from the guest

 all this sounds very intrusive and for sure (please
 proove me wrong here :) adds a lot of overhead to the
 networking itself, while a 'simple' isolation approach
 for IP (tcp/udp) is (almost) without any cost, certainly
 without overhead once a connection is established.

Thanks, for the good summary of the situation.

I think we can prove you wrong but it is going to take
some doing to build a good implementation and take
the necessary measurements.

Hmm.  I wonder if the filtering layer 3 style of isolation can be built with
netfilter rules.  Just skimming it looks we may be able to do it with something
like the netfilter owner module, possibly in conjunction with the connmark or
conntrack modules.  If not if the infrastructure is close enough we can write
our own module.

Has anyone looked at network isolation from the netfilter perspective?

Eric

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-09-06 Thread Eric W. Biederman
Kir Kolyshkin [EMAIL PROTECTED] writes:

 Herbert Poetzl wrote:
 my point (until we have an implementation which clearly
 shows that performance is equal/better to isolation)
 is simply this:

  of course, you can 'simulate' or 'construct' all the
  isolation scenarios with kernel bridging and routing
  and tricky injection/marking of packets, but, this
  usually comes with an overhead ...
   
 Well, TANSTAAFL*, and pretty much everything comes with an overhead. 
 Multitasking comes with the (scheduler, context switch, CPU cache, etc.) 
 overhead -- is that the reason to abandon it? OpenVZ and Linux-VServer 
 resource management also adds some overhead -- do we want to throw it away?

 The question is not just equal or better performance, the question is 
 what do we get and how much we pay for it.

Equal or better performance is certainly required when we have the code
compiled in but aren't using it.  We must not penalize the current code.

 Finally, as I understand both network isolation and network 
 virtualization (both level2 and level3) can happily co-exist. We do have 
 several filesystems in kernel. Let's have several network virtualization 
 approaches, and let a user choose. Is that makes sense?

If there are not compelling arguments for using both ways of doing
it is silly to merge both, as it is more maintenance overhead.

That said I think there is a real chance if we can look at the bind
filtering and find a way to express that in the networking stack
through iptables.  Using the security hooks conflicts with things
like selinux.   Although it would be interesting to see if selinux
can already implement general purpose layer 3 filtering.

The more I look the gut feel I have is that the way to proceed would
be to add a new table that filters binds, and connects.  Plus a new
module that would look at a process creating a socket and tell us if
it is the appropriate group of processes.  With a little care that
would be a general solution to the layer 3 filtering problem.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-09-06 Thread Kir Kolyshkin

Eric W. Biederman wrote:

Kir Kolyshkin [EMAIL PROTECTED] writes:

  

Herbert Poetzl wrote:


my point (until we have an implementation which clearly
shows that performance is equal/better to isolation)
is simply this:

 of course, you can 'simulate' or 'construct' all the
 isolation scenarios with kernel bridging and routing
 and tricky injection/marking of packets, but, this
 usually comes with an overhead ...
  
  
Well, TANSTAAFL*, and pretty much everything comes with an overhead. 
Multitasking comes with the (scheduler, context switch, CPU cache, etc.) 
overhead -- is that the reason to abandon it? OpenVZ and Linux-VServer 
resource management also adds some overhead -- do we want to throw it away?


The question is not just equal or better performance, the question is 
what do we get and how much we pay for it.



Equal or better performance is certainly required when we have the code
compiled in but aren't using it.  We must not penalize the current code.
  
That's a valid argument. Although it's not applicable here (at least for 
both network virtualization types which OpenVZ offers). Kirill/Andrey, 
please correct me if I'm wrong here.
Finally, as I understand both network isolation and network 
virtualization (both level2 and level3) can happily co-exist. We do have 
several filesystems in kernel. Let's have several network virtualization 
approaches, and let a user choose. Is that makes sense?


o
If there are not compelling arguments for using both ways of doing
it is silly to merge both, as it is more maintenance overhead.
  

Definitely a valid argument as well.

I am not sure about network isolation (used by Linux-VServer), but as 
it comes for level2 vs. level3 virtualization, I see a need for both. 
Here is the easy-to-understand comparison which can shed some light: 
http://wiki.openvz.org/Differences_between_venet_and_veth


Here are a couple of examples
* Do we want to let container's owner (i.e. root) to add/remove IP 
addresses? Most probably not, but in some cases we want that.
* Do we want to be able to run DHCP server and/or DHCP client inside a 
container? Sometimes...but not always.
* Do we want to let container's owner to create/manage his own set of 
iptables? In half of the cases we do.


The problem here is single solution will not cover all those scenarios.

That said I think there is a real chance if we can look at the bind
filtering and find a way to express that in the networking stack
through iptables.  Using the security hooks conflicts with things
like selinux.   Although it would be interesting to see if selinux
can already implement general purpose layer 3 filtering.

The more I look the gut feel I have is that the way to proceed would
be to add a new table that filters binds, and connects.  Plus a new
module that would look at a process creating a socket and tell us if
it is the appropriate group of processes.  With a little care that
would be a general solution to the layer 3 filtering problem.

Eric
  


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-09-06 Thread Cedric Le Goater
Eric W. Biederman wrote:
 This family of containers are used too for HPC (high performance computing) 
 and
 for distributed checkpoint/restart. The cluster runs hundred of jobs, 
 spawning
 them on different hosts inside an application container. Usually the jobs
 communicates with broadcast and multicast.
 Application containers does not care of having different MAC address and 
 rely on
 a layer 3 approach.
 
 Ok I think to understand this we need some precise definitions.
 In the normal case it is an error for a job to communication with a different
 job.  

hmm ? What about an MPI application ?

I would expect each MPI task to be run in its container on different nodes
or on the same node. These individual tasks _communicate_ between each
other through the MPI layer (not only TCP btw) to complete a large calculation.

 The basic advantage with a different MAC is that you can found out who the
 intended recipient is sooner in the networking stack and you have truly
 separate network devices.  Allowing for a cleaner implementation.
 
 Changing the MAC after migration is likely to be fine.

indeed.

C.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-09-06 Thread Eric W. Biederman
Cedric Le Goater [EMAIL PROTECTED] writes:

 Eric W. Biederman wrote:

 hmm ? What about an MPI application ?

 I would expect each MPI task to be run in its container on different nodes
 or on the same node. These individual tasks _communicate_ between each
 other through the MPI layer (not only TCP btw) to complete a large 
 calculation.

All parts of the MPI application are part of the same job.  Communication 
between
processes on multiple machines that are part of the job is fine.

At least that is how I used job in HPC computer context.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-09-06 Thread Cedric Le Goater
Kir Kolyshkin wrote:

snip

 I am not sure about network isolation (used by Linux-VServer), but as 
 it comes for level2 vs. level3 virtualization, I see a need for both. 
 Here is the easy-to-understand comparison which can shed some light: 
 http://wiki.openvz.org/Differences_between_venet_and_veth

thanks kir,

 Here are a couple of examples
 * Do we want to let container's owner (i.e. root) to add/remove IP 
 addresses? Most probably not, but in some cases we want that.
 * Do we want to be able to run DHCP server and/or DHCP client inside a 
 container? Sometimes...but not always.
 * Do we want to let container's owner to create/manage his own set of 
 iptables? In half of the cases we do.
 
 The problem here is single solution will not cover all those scenarios.

some would argue that there is one single solution : Xen or similar.

IMO, I think containers should try to leverage their difference,
performance, and not try to simulate a real hardware environment.

Restricting the network environment of a container should be considered
acceptable if this is for the sake of performance. The network interface(s)
could be pre-configured and provided to the container. Protocol(s) could be
forbidden.

Now, if you need more network power in a container, you will need a real or
a virtualized interface.

But let's consider both alternatives.

thanks,

C.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [RFC] network namespaces

2006-09-06 Thread Daniel Lezcano

Kir Kolyshkin wrote:

Herbert Poetzl wrote:


my point (until we have an implementation which clearly
shows that performance is equal/better to isolation)
is simply this:

 of course, you can 'simulate' or 'construct' all the
 isolation scenarios with kernel bridging and routing
 and tricky injection/marking of packets, but, this
 usually comes with an overhead ...
  


Well, TANSTAAFL*, and pretty much everything comes with an overhead. 
Multitasking comes with the (scheduler, context switch, CPU cache, etc.) 
overhead -- is that the reason to abandon it? OpenVZ and Linux-VServer 
resource management also adds some overhead -- do we want to throw it away?


The question is not just equal or better performance, the question is 
what do we get and how much we pay for it.


Finally, as I understand both network isolation and network 
virtualization (both level2 and level3) can happily co-exist. We do have 
several filesystems in kernel. Let's have several network virtualization 
approaches, and let a user choose. Is that makes sense?


Definitly yes, I agree.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC] network namespaces

2006-09-06 Thread Caitlin Bestler
[EMAIL PROTECTED] wrote:
 
 
 Finally, as I understand both network isolation and network
 virtualization (both level2 and level3) can happily co-exist. We do
 have several filesystems in kernel. Let's have several network
 virtualization approaches, and let a user choose. Is that makes
 sense? 
 
 If there are not compelling arguments for using both ways of
 doing it is silly to merge both, as it is more maintenance overhead.
 

My reading is that full virtualization (Xen, etc.) calls for
implementing
L2 switching between the partitions and the physical NIC(s).

The tradeoffs between L2 and L3 switching are indeed complex, but
there are two implications of doing L2 switching between partitions:

1) Do we really want to ask device drivers to support L2 switching for
   partitions and something *different* for containers?

2) Do we really want any single packet to traverse an L2 switch (for
   the partition-style virtualization layer) and then an L3 switch
   (for the container-style layer)?

The full virtualization solution calls for virtual NICs with distinct
MAC addresses. Is there any reason why this same solution cannot work
for containers (just creating more than one VNIC for the partition, 
and then assigning each VNIC to a container?)

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-09-06 Thread Eric W. Biederman
Caitlin Bestler [EMAIL PROTECTED] writes:

 [EMAIL PROTECTED] wrote:
  
 
 Finally, as I understand both network isolation and network
 virtualization (both level2 and level3) can happily co-exist. We do
 have several filesystems in kernel. Let's have several network
 virtualization approaches, and let a user choose. Is that makes
 sense? 
 
 If there are not compelling arguments for using both ways of
 doing it is silly to merge both, as it is more maintenance overhead.
 

 My reading is that full virtualization (Xen, etc.) calls for
 implementing
 L2 switching between the partitions and the physical NIC(s).

 The tradeoffs between L2 and L3 switching are indeed complex, but
 there are two implications of doing L2 switching between partitions:

 1) Do we really want to ask device drivers to support L2 switching for
partitions and something *different* for containers?

No.

 2) Do we really want any single packet to traverse an L2 switch (for
the partition-style virtualization layer) and then an L3 switch
(for the container-style layer)?

In general what has been done with layer 3 is to simply filter which
processes can use which IP addresses and it all happens at socket
creation time.  So it is very cheap, and it can be done purely
in the network layer without any driver intervention.

Basically think of what is happening at layer 3 as an extremely light-weight
version of traffic filtering.

 The full virtualization solution calls for virtual NICs with distinct
 MAC addresses. Is there any reason why this same solution cannot work
 for containers (just creating more than one VNIC for the partition, 
 and then assigning each VNIC to a container?)

The VNIC approach is the fundamental idea with the layer two networking
and if we can push the work down into the device driver it so different
destination macs show up a in different packet queues it should be
as fast as a normal networking stack.

Implementing VNICs so far is the only piece of containers that has
come close to device drivers, and we can likely do it without device
driver support (but with more cost).  Basically this optimization
is a subset of the Grand Unified Lookup idea.

I think we can do a mergeable implementation without noticeable cost without
when not using containers without having to resort to a grand unified lookup
but I may be wrong.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-09-06 Thread Stephen Hemminger
On Wed, 06 Sep 2006 17:25:50 -0600
[EMAIL PROTECTED] (Eric W. Biederman) wrote:

 Caitlin Bestler [EMAIL PROTECTED] writes:
 
  [EMAIL PROTECTED] wrote:
   
  
  Finally, as I understand both network isolation and network
  virtualization (both level2 and level3) can happily co-exist. We do
  have several filesystems in kernel. Let's have several network
  virtualization approaches, and let a user choose. Is that makes
  sense? 
  
  If there are not compelling arguments for using both ways of
  doing it is silly to merge both, as it is more maintenance overhead.
  
 
  My reading is that full virtualization (Xen, etc.) calls for
  implementing
  L2 switching between the partitions and the physical NIC(s).
 
  The tradeoffs between L2 and L3 switching are indeed complex, but
  there are two implications of doing L2 switching between partitions:
 
  1) Do we really want to ask device drivers to support L2 switching for
 partitions and something *different* for containers?
 
 No.
 
  2) Do we really want any single packet to traverse an L2 switch (for
 the partition-style virtualization layer) and then an L3 switch
 (for the container-style layer)?
 
 In general what has been done with layer 3 is to simply filter which
 processes can use which IP addresses and it all happens at socket
 creation time.  So it is very cheap, and it can be done purely
 in the network layer without any driver intervention.
 
 Basically think of what is happening at layer 3 as an extremely light-weight
 version of traffic filtering.
 
  The full virtualization solution calls for virtual NICs with distinct
  MAC addresses. Is there any reason why this same solution cannot work
  for containers (just creating more than one VNIC for the partition, 
  and then assigning each VNIC to a container?)
 
 The VNIC approach is the fundamental idea with the layer two networking
 and if we can push the work down into the device driver it so different
 destination macs show up a in different packet queues it should be
 as fast as a normal networking stack.
 
 Implementing VNICs so far is the only piece of containers that has
 come close to device drivers, and we can likely do it without device
 driver support (but with more cost).  Basically this optimization
 is a subset of the Grand Unified Lookup idea.
 
 I think we can do a mergeable implementation without noticeable cost without
 when not using containers without having to resort to a grand unified lookup
 but I may be wrong.
 
 Eric

The problem with VNIC's is it won't work for all devices (without lots of
work), and for many device's it requires putting the device in promiscuous
mode. It also plays havoc with network access control devices.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-09-06 Thread Eric W. Biederman
Stephen Hemminger [EMAIL PROTECTED] writes:

 The problem with VNIC's is it won't work for all devices (without lots of
 work), and for many device's it requires putting the device in promiscuous
 mode. It also plays havoc with network access control devices.

Which is fine.  If it works it is a cool performance optimization.
But it doesn't stop anything from my side if it doesn't.

Eric

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-09-05 Thread Daniel Lezcano

Hi all,


  This complete separation of namespaces is very useful for at least two
  purposes:
   - allowing users to create and manage by their own various tunnels and
 VPNs, and
   - enabling easier and more straightforward live migration of groups of
 processes with their environment.



I conceptually prefer this approach, but I seem to recall there were
actual problems in using this for checkpoint/restart of lightweight
(application) containers.  Performance aside, are there any reasons why
this approach would be problematic for c/r?


I agree with this approach too, separated namespaces is the best way to 
identify the network ressources for a specific container.



I'm afraid Daniel may be on vacation, and don't know who else other than
Eric might have thoughts on this.


Yes, I was in vacation, but I am back :)


2. People expressed concerns that complete separation of namespaces
  may introduce an undesired overhead in certain usage scenarios.
  The overhead comes from packets traversing input path, then output path,
  then input path again in the destination namespace if root namespace
  acts as a router.


Yes, performance is probably one issue.

My concerns was for layer 2 / layer 3 virtualization. I agree a layer 2 
isolation/virtualization is the best for the system container.
But there is another family of container called application container, 
it is not a system which is run inside a container but only the 
application. If you want to run a oracle database inside a container, 
you can run it inside an application container without launching init 
and all the services.


This family of containers are used too for HPC (high performance 
computing) and for distributed checkpoint/restart. The cluster runs 
hundred of jobs, spawning them on different hosts inside an application 
container. Usually the jobs communicates with broadcast and multicast.
Application containers does not care of having different MAC address and 
rely on a layer 3 approach.


Are application containers comfortable with a layer 2 virtualization ? I 
 don't think so, because several jobs running inside the same host 
communicate via broadcast/multicast between them and between other jobs 
running on different hosts. The IP consumption is a problem too: 1 
container == 2 IP (one for the root namespace/ one for the container), 
multiplicated with the number of jobs. Furthermore, lot of jobs == lot 
of virtual devices.


However, after a discussion with Kirill at the OLS, it appears we can 
merge the layer 2 and 3 approaches if the level of network 
virtualization is tunable and we can choose layer 2 or layer 3 when 
doing the unshare. The determination of the namespace for the incoming 
traffic can be done with an specific iptable module as a first step. 
While looking at the network namespace patches, it appears that the 
TCP/UDP part is **very** similar at what is needed for a layer 3 approach.


Any thoughts ?

Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-09-05 Thread Eric W. Biederman
Daniel Lezcano [EMAIL PROTECTED] writes:

2. People expressed concerns that complete separation of namespaces
   may introduce an undesired overhead in certain usage scenarios.
   The overhead comes from packets traversing input path, then output path,
   then input path again in the destination namespace if root namespace
   acts as a router.

 Yes, performance is probably one issue.

 My concerns was for layer 2 / layer 3 virtualization. I agree a layer 2
 isolation/virtualization is the best for the system container.
 But there is another family of container called application container, it is
 not a system which is run inside a container but only the application. If you
 want to run a oracle database inside a container, you can run it inside an
 application container without launching init and all the services.

 This family of containers are used too for HPC (high performance computing) 
 and
 for distributed checkpoint/restart. The cluster runs hundred of jobs, spawning
 them on different hosts inside an application container. Usually the jobs
 communicates with broadcast and multicast.
 Application containers does not care of having different MAC address and rely 
 on
 a layer 3 approach.

 Are application containers comfortable with a layer 2 virtualization ? I don't
 think so, because several jobs running inside the same host communicate via
 broadcast/multicast between them and between other jobs running on different
 hosts. The IP consumption is a problem too: 1 container == 2 IP (one for the
 root namespace/ one for the container), multiplicated with the number of
 jobs. Furthermore, lot of jobs == lot of virtual devices.

 However, after a discussion with Kirill at the OLS, it appears we can merge 
 the
 layer 2 and 3 approaches if the level of network virtualization is tunable and
 we can choose layer 2 or layer 3 when doing the unshare. The determination 
 of
 the namespace for the incoming traffic can be done with an specific iptable
 module as a first step. While looking at the network namespace patches, it
 appears that the TCP/UDP part is **very** similar at what is needed for a 
 layer
 3 approach.

 Any thoughts ?

For HPC if you are interested in migration you need a separate IP per
container.  If you can take you IP address with you migration of
networking state is simple.  If you can't take your IP address with
you a network container is nearly pointless from a migration
perspective.

Beyond that from everything I have seen layer 2 is just much cleaner
than any layer 3 approach short of Serge's bind filtering.

Beyond that I have yet to see a clean semantics for anything
resembling your layer 2 layer 3 hybrid approach.  If we can't have
clear semantics it is by definition impossible to implement correctly
because no one understands what it is supposed to do.

Note.  A true layer 3 approach has no impact on TCP/UDP filtering
because it filters at bind time not at packet reception time.  Once
you start inspecting packets I don't see what the gain is from not
going all of the way to layer 2.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-09-05 Thread Daniel Lezcano

For HPC if you are interested in migration you need a separate IP per
container.  If you can take you IP address with you migration of
networking state is simple.  If you can't take your IP address with
you a network container is nearly pointless from a migration
perspective.


Eric, please, I know... I showed you a migration demo at OLS ;)


Beyond that from everything I have seen layer 2 is just much cleaner
than any layer 3 approach short of Serge's bind filtering.



Beyond that I have yet to see a clean semantics for anything
resembling your layer 2 layer 3 hybrid approach.  If we can't have
clear semantics it is by definition impossible to implement correctly
because no one understands what it is supposed to do.



Note.  A true layer 3 approach has no impact on TCP/UDP filtering
because it filters at bind time not at packet reception time.  Once
you start inspecting packets I don't see what the gain is from not
going all of the way to layer 2.


The bsdjail was just for information ...


- Daniel

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-09-05 Thread Kirill Korotaev

Yes, performance is probably one issue.

My concerns was for layer 2 / layer 3 virtualization. I agree a layer 2 
isolation/virtualization is the best for the system container.
But there is another family of container called application container, 
it is not a system which is run inside a container but only the 
application. If you want to run a oracle database inside a container, 
you can run it inside an application container without launching init 
and all the services.


This family of containers are used too for HPC (high performance 
computing) and for distributed checkpoint/restart. The cluster runs 
hundred of jobs, spawning them on different hosts inside an application 
container. Usually the jobs communicates with broadcast and multicast.
Application containers does not care of having different MAC address and 
rely on a layer 3 approach.


Are application containers comfortable with a layer 2 virtualization ? I 
 don't think so, because several jobs running inside the same host 
communicate via broadcast/multicast between them and between other jobs 
running on different hosts. The IP consumption is a problem too: 1 
container == 2 IP (one for the root namespace/ one for the container), 
multiplicated with the number of jobs. Furthermore, lot of jobs == lot 
of virtual devices.


However, after a discussion with Kirill at the OLS, it appears we can 
merge the layer 2 and 3 approaches if the level of network 
virtualization is tunable and we can choose layer 2 or layer 3 when 
doing the unshare. The determination of the namespace for the incoming 
traffic can be done with an specific iptable module as a first step. 
While looking at the network namespace patches, it appears that the 
TCP/UDP part is **very** similar at what is needed for a layer 3 approach.


Any thoughts ?

My humble opinion is that your approach doesn't intersect with this one.
So we can freely go with both *if needed*.
And hear the comments from network guru guys and what and how to improve.

So I suggest you at least to send the patches, so we could discuss it.

Thanks,
Kirill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-09-05 Thread Herbert Poetzl
On Tue, Sep 05, 2006 at 08:45:39AM -0600, Eric W. Biederman wrote:
 Daniel Lezcano [EMAIL PROTECTED] writes:
 
 2. People expressed concerns that complete separation of namespaces
may introduce an undesired overhead in certain usage scenarios.
The overhead comes from packets traversing input path, then output path,
then input path again in the destination namespace if root namespace
acts as a router.
 
  Yes, performance is probably one issue.
 
  My concerns was for layer 2 / layer 3 virtualization. I agree
  a layer 2 isolation/virtualization is the best for the system
  container. But there is another family of container called
  application container, it is not a system which is run inside a
  container but only the application. If you want to run a oracle
  database inside a container, you can run it inside an application
  container without launching init and all the services.
 
  This family of containers are used too for HPC (high performance
  computing) and for distributed checkpoint/restart. The cluster
  runs hundred of jobs, spawning them on different hosts inside an
  application container. Usually the jobs communicates with broadcast
  and multicast. Application containers does not care of having
  different MAC address and rely on a layer 3 approach.
 
  Are application containers comfortable with a layer 2 virtualization
  ? I don't think so, because several jobs running inside the same
  host communicate via broadcast/multicast between them and between
  other jobs running on different hosts. The IP consumption is a
  problem too: 1 container == 2 IP (one for the root namespace/
  one for the container), multiplicated with the number of jobs.
  Furthermore, lot of jobs == lot of virtual devices.
 
  However, after a discussion with Kirill at the OLS, it appears we
  can merge the layer 2 and 3 approaches if the level of network
  virtualization is tunable and we can choose layer 2 or layer 3 when
  doing the unshare. The determination of the namespace for the
  incoming traffic can be done with an specific iptable module as
  a first step. While looking at the network namespace patches, it
  appears that the TCP/UDP part is **very** similar at what is needed
  for a layer
  3 approach.
 
  Any thoughts ?
 
 For HPC if you are interested in migration you need a separate IP
 per container. If you can take you IP address with you migration of
 networking state is simple. If you can't take your IP address with you
 a network container is nearly pointless from a migration perspective.

 Beyond that from everything I have seen layer 2 is just much cleaner
 than any layer 3 approach short of Serge's bind filtering.

well, the 'ip subset' approach Linux-VServer and
other Jail solutions use is very clean, it just does
not match your expectations of a virtual interface
(as there is none) and it does not cope well with
all kinds of per context 'requirements', which IMHO
do not really exist on the application layer (only
on the whole system layer)

 Beyond that I have yet to see a clean semantics for anything
 resembling your layer 2 layer 3 hybrid approach. If we can't have
 clear semantics it is by definition impossible to implement correctly
 because no one understands what it is supposed to do.

IMHO that would be quite simple, have a 'namespace'
for limiting port binds to a subset of the available
ips and another one which does complete network 
virtualization with all the whistles and bells, IMHO
most of them are orthogonal and can easily be combined

 - full network virtualization
 - lightweight ip subset 
 - both

 Note. A true layer 3 approach has no impact on TCP/UDP filtering
 because it filters at bind time not at packet reception time. Once you
 start inspecting packets I don't see what the gain is from not going
 all of the way to layer 2.

IMHO this requirement only arises from the full system
virtualization approach, just look at the other jail
solutions (solaris, bsd, ...) some of them do not even 
allow for more than a single ip but they work quite
well when used properly ...

best,
Herbert

 Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-09-05 Thread Eric W. Biederman

 This family of containers are used too for HPC (high performance computing) 
 and
 for distributed checkpoint/restart. The cluster runs hundred of jobs, spawning
 them on different hosts inside an application container. Usually the jobs
 communicates with broadcast and multicast.
 Application containers does not care of having different MAC address and rely 
 on
 a layer 3 approach.

Ok I think to understand this we need some precise definitions.
In the normal case it is an error for a job to communication with a different
job.  

The basic advantage with a different MAC is that you can found out who the
intended recipient is sooner in the networking stack and you have truly
separate network devices.  Allowing for a cleaner implementation.

Changing the MAC after migration is likely to be fine.

 Are application containers comfortable with a layer 2 virtualization ? I don't
 think so, because several jobs running inside the same host communicate via
 broadcast/multicast between them and between other jobs running on different
 hosts. The IP consumption is a problem too: 1 container == 2 IP (one for the
 root namespace/ one for the container), multiplicated with the number of
 jobs. Furthermore, lot of jobs == lot of virtual devices.

First if you hook you network namespaces with ethernet bridging
you don't need any extra IPs.

Second don't see the conflict you perceive between application containers
and layer 2 containment.

The bottom line is that you need at least one loopback interface per non-trivial
network namespace.  One you get that having a virtual is no big deal.  In
addition network devices don't consume less memory than a process.  So lots
of network devices should not be a problem. 

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-09-05 Thread Eric W. Biederman
Herbert Poetzl [EMAIL PROTECTED] writes:

 On Tue, Sep 05, 2006 at 08:45:39AM -0600, Eric W. Biederman wrote:
 Daniel Lezcano [EMAIL PROTECTED] writes:
 
 For HPC if you are interested in migration you need a separate IP
 per container. If you can take you IP address with you migration of
 networking state is simple. If you can't take your IP address with you
 a network container is nearly pointless from a migration perspective.

 Beyond that from everything I have seen layer 2 is just much cleaner
 than any layer 3 approach short of Serge's bind filtering.

 well, the 'ip subset' approach Linux-VServer and
 other Jail solutions use is very clean, it just does
 not match your expectations of a virtual interface
 (as there is none) and it does not cope well with
 all kinds of per context 'requirements', which IMHO
 do not really exist on the application layer (only
 on the whole system layer)

I probably expressed that wrong.  There are currently three
basic approaches under discussion.
Layer 3 (Basically bind filtering) nothing at the packet level.
   The approach taken by Serge's version of bsdjails and Vserver.

Layer 2.5 What Daniel proposed.

Layer 2.  (Trivially mapping each packet to a different interface)
   And then treating everything as multiple instances of the
   network stack.
Roughly what OpenVZ and I have implemented.

You can get into some weird complications at layer 3 but because
it doesn't touch each packet the proof it is fast is trivial.

 Beyond that I have yet to see a clean semantics for anything
 resembling your layer 2 layer 3 hybrid approach. If we can't have
 clear semantics it is by definition impossible to implement correctly
 because no one understands what it is supposed to do.

 IMHO that would be quite simple, have a 'namespace'
 for limiting port binds to a subset of the available
 ips and another one which does complete network 
 virtualization with all the whistles and bells, IMHO
 most of them are orthogonal and can easily be combined

  - full network virtualization
  - lightweight ip subset 
  - both

Quite possibly.  The LSM will stay for a while so we do have
a clean way to restrict port binds.

 Note. A true layer 3 approach has no impact on TCP/UDP filtering
 because it filters at bind time not at packet reception time. Once you
 start inspecting packets I don't see what the gain is from not going
 all of the way to layer 2.

 IMHO this requirement only arises from the full system
 virtualization approach, just look at the other jail
 solutions (solaris, bsd, ...) some of them do not even 
 allow for more than a single ip but they work quite
 well when used properly ...


Yes they do.  Currently I am strongly opposed to Daniel Layer 2.5 approach
as I see no redeeming value in it.  A good clean layer 3 approach I 
avoid only because I think we can do better.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-08-17 Thread Kirill Korotaev

Basically there are currently 3 approaches that have been proposed.

The trivial bsdjail style as implemented by Serge and in a slightly
more sophisticated version in vserver.  This approach as it does not
touch the packets has little to no packet level overhead.  Basically
this is what I have called the Level 3 approach.

The more in depth approach where we modify the packet processing based
upon which network interface the packet comes in on, and it looks like
each namespace has it's own instance of the network stack. Roughly
what was proposed earlier in this thread the Level 2 approach.  This
potentially has per packet overhead so we need to watch the implementation
very carefully.

Some weird hybrid as proposed by Daniel, that I was never clear on the
semantics.

The good thing is that these approaches do not contradict each other.
We discussed it with Daniel during the summit and as Andrey proposed
some shortcuts can be created to avoid double stack traversing.


From the previous conversations my impression was that as long as

we could get a Layer 2 approach that did not slow down the networking
stack and was clean, everyone would be happy.

agree.


I'm buried in the process id namespace at the moment, and except
to be so for the rest of the month, so I'm not going to be
very helpful except for a few stray comments.

I will be very much obliged if you find some time to review these new
patches so that we could make some progress here.

Thanks,
Kirill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] network namespaces

2006-08-16 Thread Andrey Savochkin
Hi All,

I'd like to resurrect our discussion about network namespaces.
In our previous discussions it appeared that we have rather polar concepts
which seemed hard to reconcile.
Now I have an idea how to look at all discussed concepts to enable everyone's
usage scenario.

1. The most straightforward concept is complete separation of namespaces,
   covering device list, routing tables, netfilter tables, socket hashes, and
   everything else.

   On input path, each packet is tagged with namespace right from the
   place where it appears from a device, and is processed by each layer
   in the context of this namespace.
   Non-root namespaces communicate with the outside world in two ways: by
   owning hardware devices, or receiving packets forwarded them by their parent
   namespace via pass-through device.

   This complete separation of namespaces is very useful for at least two
   purposes:
- allowing users to create and manage by their own various tunnels and
  VPNs, and
- enabling easier and more straightforward live migration of groups of
  processes with their environment.

2. People expressed concerns that complete separation of namespaces
   may introduce an undesired overhead in certain usage scenarios.
   The overhead comes from packets traversing input path, then output path,
   then input path again in the destination namespace if root namespace
   acts as a router.

   So, we may introduce short-cuts, when input packet starts to be processes
   in one namespace, but changes it at some upper layer.
   The places where packet can change namespace are, for example:
   routing, post-routing netfilter hook, or even lookup in socket hash.

   The cleanest example among them is post-routing netfilter hook.
   Tagging of input packets there means that the packets is checked against
   root namespace's routing table, found to be local, and go directly to
   the socket hash lookup in the destination namespace.
   In this scheme the ability to change routing tables or netfilter rules on
   a per-namespace basis is traded for lower overhead.

   All other optimized schemes where input packets do not travel
   input-output-input paths in general case may be viewed as short-cuts in
   scheme (1).  The remaining question is which exactly short-cuts make most
   sense, and how to make them consistent from the interface point of view.

My current idea is to reach some agreement on the basic concept, review
patches, and then move on to implementing feasible short-cuts.

Opinions?

Next in this thread are patches introducing namespaces to device list,
IPv4 routing, and socket hashes, and a pass-through device.
Patches are against 2.6.18-rc4-mm1.

Best regards,

Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-08-16 Thread Serge E. Hallyn
Quoting Andrey Savochkin ([EMAIL PROTECTED]):
 Hi All,
 
 I'd like to resurrect our discussion about network namespaces.
 In our previous discussions it appeared that we have rather polar concepts
 which seemed hard to reconcile.
 Now I have an idea how to look at all discussed concepts to enable everyone's
 usage scenario.
 
 1. The most straightforward concept is complete separation of namespaces,
covering device list, routing tables, netfilter tables, socket hashes, and
everything else.
 
On input path, each packet is tagged with namespace right from the
place where it appears from a device, and is processed by each layer
in the context of this namespace.
Non-root namespaces communicate with the outside world in two ways: by
owning hardware devices, or receiving packets forwarded them by their 
 parent
namespace via pass-through device.
 
This complete separation of namespaces is very useful for at least two
purposes:
 - allowing users to create and manage by their own various tunnels and
   VPNs, and
 - enabling easier and more straightforward live migration of groups of
   processes with their environment.

I conceptually prefer this approach, but I seem to recall there were
actual problems in using this for checkpoint/restart of lightweight
(application) containers.  Performance aside, are there any reasons why
this approach would be problematic for c/r?

I'm afraid Daniel may be on vacation, and don't know who else other than
Eric might have thoughts on this.

 2. People expressed concerns that complete separation of namespaces
may introduce an undesired overhead in certain usage scenarios.
The overhead comes from packets traversing input path, then output path,
then input path again in the destination namespace if root namespace
acts as a router.
 
So, we may introduce short-cuts, when input packet starts to be processes
in one namespace, but changes it at some upper layer.
The places where packet can change namespace are, for example:
routing, post-routing netfilter hook, or even lookup in socket hash.
 
The cleanest example among them is post-routing netfilter hook.
Tagging of input packets there means that the packets is checked against
root namespace's routing table, found to be local, and go directly to
the socket hash lookup in the destination namespace.
In this scheme the ability to change routing tables or netfilter rules on
a per-namespace basis is traded for lower overhead.
 
All other optimized schemes where input packets do not travel
input-output-input paths in general case may be viewed as short-cuts in
scheme (1).  The remaining question is which exactly short-cuts make most
sense, and how to make them consistent from the interface point of view.
 
 My current idea is to reach some agreement on the basic concept, review
 patches, and then move on to implementing feasible short-cuts.
 
 Opinions?
 
 Next in this thread are patches introducing namespaces to device list,
 IPv4 routing, and socket hashes, and a pass-through device.
 Patches are against 2.6.18-rc4-mm1.

Just to provide the extreme other end of implementation options, here is
the bsdjail based version I've been using for some testing while waiting
for network namespaces to show up in -mm  :)

(Not intended for *any* sort of inclusion consideration :)

Example usage:
ifconfig eth0:0 192.168.1.16
echo -n ip 192.168.1.16  /proc/$$/attr/exec
exec /bin/sh

-serge

From: Serge E. Hallyn [EMAIL PROTECTED](none)
Date: Wed, 26 Jul 2006 21:47:13 -0500
Subject: [PATCH 1/1] bsdjail: define bsdjail lsm

Define the actual bsdjail LSM.

Signed-off-by: Serge E. Hallyn [EMAIL PROTECTED]
---
 security/Kconfig   |   11 
 security/Makefile  |1 
 security/bsdjail.c | 1351 
 3 files changed, 1363 insertions(+), 0 deletions(-)

diff --git a/security/Kconfig b/security/Kconfig
index 67785df..fa30e40 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -105,6 +105,17 @@ config SECURITY_SECLVL
 
  If you are unsure how to answer this question, answer N.
 
+config SECURITY_BSDJAIL
+   tristate BSD Jail LSM
+   depends on SECURITY
+   select SECURITY_NETWORK
+   help
+ Provides BSD Jail compartmentalization functionality.
+ See Documentation/bsdjail.txt for more information and
+ usage instructions.
+
+ If you are unsure how to answer this question, answer N.
+
 source security/selinux/Kconfig
 
 endmenu
diff --git a/security/Makefile b/security/Makefile
index 8cbbf2f..050b588 100644
--- a/security/Makefile
+++ b/security/Makefile
@@ -17,3 +17,4 @@ obj-$(CONFIG_SECURITY_SELINUX)+= selin
 obj-$(CONFIG_SECURITY_CAPABILITIES)+= commoncap.o capability.o
 obj-$(CONFIG_SECURITY_ROOTPLUG)+= commoncap.o root_plug.o
 obj-$(CONFIG_SECURITY_SECLVL)  += 

Re: [RFC] network namespaces

2006-08-16 Thread Alexey Kuznetsov
Hello!

 (application) containers.  Performance aside, are there any reasons why
 this approach would be problematic for c/r?

This approach is just perfect for c/r.

Probably, this is the only approach when migration can be done
in a clean and self-consistent way.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-08-16 Thread Eric W. Biederman
Alexey Kuznetsov [EMAIL PROTECTED] writes:

 Hello!

 (application) containers.  Performance aside, are there any reasons why
 this approach would be problematic for c/r?

 This approach is just perfect for c/r.

Yes.  For c/r you need to take your state with you.

 Probably, this is the only approach when migration can be done
 in a clean and self-consistent way.

Basically there are currently 3 approaches that have been proposed.

The trivial bsdjail style as implemented by Serge and in a slightly
more sophisticated version in vserver.  This approach as it does not
touch the packets has little to no packet level overhead.  Basically
this is what I have called the Level 3 approach.

The more in depth approach where we modify the packet processing based
upon which network interface the packet comes in on, and it looks like
each namespace has it's own instance of the network stack. Roughly
what was proposed earlier in this thread the Level 2 approach.  This
potentially has per packet overhead so we need to watch the implementation
very carefully.

Some weird hybrid as proposed by Daniel, that I was never clear on the
semantics.

From the previous conversations my impression was that as long as
we could get a Layer 2 approach that did not slow down the networking
stack and was clean, everyone would be happy.

I'm buried in the process id namespace at the moment, and except
to be so for the rest of the month, so I'm not going to be
very helpful except for a few stray comments.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Network namespaces a path to mergable code.

2006-06-28 Thread Cedric Le Goater
Hello,

Eric W. Biederman wrote:
 Thinking about this I am going to suggest a slightly different direction
 for get a patchset we can merge.
 
 First we concentrate on the fundamentals.
 - How we mark a device as belonging to a specific network namespace.
 - How we mark a socket as belonging to a specific network namespace.
 
 As part of the fundamentals we add a patch to the generic socket code
 that by default will disable it for protocol families that do not indicate
 support for handling network namespaces, on a non-default network namespace.
 
 I think that gives us a path that will allow us to convert the network stack
 one protocol family at a time instead of in one big lump.
 
 Stubbing off the sysfs and sysctl interfaces in the first round for the
 non-default namespaces as you have done should be good enough.
 
 The reason for the suggestion is that most of the work for the protocol
 stacks ipv4 ipv6 af_packet af_unix is largely noise, and simple
 replacement without real design work happening.  Mostly it is just
 tweaking the code to remove global variables, and doing a couple
 lookups.

How that proposal differs from the initial Daniel's patchset ? how far was
that patchset to reach a similar agreement ?

OK, i wear blue socks :), but I'm not advocating a patchset more than
another i'm just looking for a shorter path.

thanks,

C.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Network namespaces a path to mergable code.

2006-06-28 Thread Eric W. Biederman
Cedric Le Goater [EMAIL PROTECTED] writes:

 How that proposal differs from the initial Daniel's patchset ? how far was
 that patchset to reach a similar agreement ?

My impression is as follows.  The OpenVz implementation and mine work
on the same basic principles of handling the network stack at layer 2.

We have our implementation differences but the core ideas are about the
same.

Daniels patch still had elements of layer 3 handling as I recall
and that has problems.

 OK, i wear blue socks :), but I'm not advocating a patchset more than
 another i'm just looking for a shorter path.

Besides laying the foundations.  The current conversation seems to be
about understanding the implications of the network stack when
we implement a network namespace.

There is a lot to the networking stack so it takes a while.
In addition this is one part of the problem that everyone has implemented,
so we have several more opinions on how it should be done and what
needs to happen.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] Network namespaces a path to mergable code.

2006-06-27 Thread Eric W. Biederman

Thinking about this I am going to suggest a slightly different direction
for get a patchset we can merge.

First we concentrate on the fundamentals.
- How we mark a device as belonging to a specific network namespace.
- How we mark a socket as belonging to a specific network namespace.

As part of the fundamentals we add a patch to the generic socket code
that by default will disable it for protocol families that do not indicate
support for handling network namespaces, on a non-default network namespace.

I think that gives us a path that will allow us to convert the network stack
one protocol family at a time instead of in one big lump.

Stubbing off the sysfs and sysctl interfaces in the first round for the
non-default namespaces as you have done should be good enough.

The reason for the suggestion is that most of the work for the protocol
stacks ipv4 ipv6 af_packet af_unix is largely noise, and simple
replacement without real design work happening.  Mostly it is just
tweaking the code to remove global variables, and doing a couple
lookups.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html