Re: [Devel] Re: Network virtualization/isolation

2006-12-09 Thread Eric W. Biederman
Herbert Poetzl [EMAIL PROTECTED] writes:

 On Fri, Dec 08, 2006 at 12:57:49PM -0700, Eric W. Biederman wrote:
 Herbert Poetzl [EMAIL PROTECTED] writes:
 
  But, ok, it is not the real point to argue so much imho 
  and waste our time instead of doing things.

  well, IMHO better talk (and think) first, then implement
  something ... not the other way round, and then start
  fixing up the mess ...
 
 Well we need a bit of both.

 hmm, are 'we' in a hurry here?

We need to talk about code, and particular patches not just talk.

There are two sides to what we are building.
- The user interface, and semantics.
- The kernel implementation.

For the user interface getting it as close to perfect as we can
the first time is extremely important.  Because we won't be able
to change it.

For the kernel implementation we don't have to be perfect we have
to have something that is good enough.  We can change the
implementation every release if we find better ways of implementing
our user space semantics. 

 until recently, 'Linux' (mainline) didn't even want
 to hear about OS Level virtualization, now there
 is a rush to quickly get 'something' in, not knowing
 or caring if it is usable at all?

 I think there are a lot of 'potential users' for
 this kind of virtualization, and so 'we' can test
 almost all aspects outside of mainline, and once
 we know the stuff works as expected, then we can
 integrate it ...

We should do this as part of the linux kernel community.  There
is no outside of mainline development.  We need to get feedback from
other developers whose code we may effect.  This is particularly true
of the kernel networking stack.

 that's something I do not really agree with, stuff
 integrated into the kernel should be well designed
 and it should be tested ...

Yes but you can break a problem into reasonable chunks and
solve each of those pieces individually. 

Eric




-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: Network virtualization/isolation

2006-12-09 Thread Tomasz Torcz
On Sat, Dec 09, 2006 at 04:50:02AM +0100, Herbert Poetzl wrote:
 On Fri, Dec 08, 2006 at 12:57:49PM -0700, Eric W. Biederman wrote:
  Herbert Poetzl [EMAIL PROTECTED] writes:
  
   But, ok, it is not the real point to argue so much imho 
   and waste our time instead of doing things.
 
   well, IMHO better talk (and think) first, then implement
   something ... not the other way round, and then start
   fixing up the mess ...
  
  Well we need a bit of both.
 
 hmm, are 'we' in a hurry here?
 
 until recently, 'Linux' (mainline) didn't even want
 to hear about OS Level virtualization, now there
 is a rush to quickly get 'something' in, not knowing
 or caring if it is usable at all?

  Maybe beacuse other Operating Systems have it? For example Solaris'
Crossbow...


-- 
Tomasz Torcz   RIP is irrevelant. Spoofing is futile.
[EMAIL PROTECTED] Your routes will be aggreggated. -- Alex Yuriev



pgpACTx0JeDC4.pgp
Description: PGP signature


Re: [Devel] Re: Network virtualization/isolation

2006-12-09 Thread Herbert Poetzl
On Sat, Dec 09, 2006 at 12:27:34PM +0100, Tomasz Torcz wrote:
 On Sat, Dec 09, 2006 at 04:50:02AM +0100, Herbert Poetzl wrote:
  On Fri, Dec 08, 2006 at 12:57:49PM -0700, Eric W. Biederman wrote:
   Herbert Poetzl [EMAIL PROTECTED] writes:
   
But, ok, it is not the real point to argue so much imho 
and waste our time instead of doing things.
  
well, IMHO better talk (and think) first, then implement
something ... not the other way round, and then start
fixing up the mess ...
   
   Well we need a bit of both.
  
  hmm, are 'we' in a hurry here?
  
  until recently, 'Linux' (mainline) didn't even want
  to hear about OS Level virtualization, now there
  is a rush to quickly get 'something' in, not knowing
  or caring if it is usable at all?
 
 Maybe beacuse other Operating Systems have it? 

well, that wasn't a good enough reason four years 
ago, when Linux-VServer tried to push a 'jail'
implementation into mainline (was called security
contexts back then, and maintained by Jacques Gelinas)

 For example Solaris' Crossbow...

yes, but the technology isn't really new, not even
on Linux and not even in the Open Source community

but don't get me wrong here, I'm absolutely for
having virtualization (or virtualization elements)
in mainline, I just don't want to see a QD hack
'we' have to suffer from the next two years :)

HTC,
Herbert

 -- 
 Tomasz Torcz   RIP is irrevelant. Spoofing is futile.
 [EMAIL PROTECTED] Your routes will be aggreggated. -- Alex Yuriev
 


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: Network virtualization/isolation

2006-12-09 Thread Dmitry Mishin
On Saturday 09 December 2006 09:35, Herbert Poetzl wrote:
 On Fri, Dec 08, 2006 at 10:13:48PM -0800, Andrew Morton wrote:
  On Sat, 9 Dec 2006 04:50:02 +0100
  Herbert Poetzl [EMAIL PROTECTED] wrote:
  
   On Fri, Dec 08, 2006 at 12:57:49PM -0700, Eric W. Biederman wrote:
Herbert Poetzl [EMAIL PROTECTED] writes:

 But, ok, it is not the real point to argue so much imho 
 and waste our time instead of doing things.
   
 well, IMHO better talk (and think) first, then implement
 something ... not the other way round, and then start
 fixing up the mess ...

Well we need a bit of both.
   
   hmm, are 'we' in a hurry here?
   
   until recently, 'Linux' (mainline) didn't even want
   to hear about OS Level virtualization, now there
   is a rush to quickly get 'something' in, not knowing
   or caring if it is usable at all?
  
  It's actually happening quite gradually and carefully.
 
 hmm, I must have missed a testing phase for the
 IPC namespace then, not that I think it is broken
 (well, maybe it is, we do not know yet)
Herbert,

you know that this code is used in our product. And in its turn, our
product is tested internally and by a community. We have no reports about
bugs in this code. If you have to say more than just something to say,
please, say it.

 
   I think there are a lot of 'potential users' for
   this kind of virtualization, and so 'we' can test
   almost all aspects outside of mainline, and once
   we know the stuff works as expected, then we can
   integrate it ...
   
   the UTS namespace was something 'we all' had already
   implemented in this (or a very similar) way, and in
   one or two interations, it should actually work as 
   expected. nevertheless, it was one of the simplest
   spaces ...
   
   we do not yet know the details for the IPC namespace,
   as IPC is not that easy to check as UTS, and 'we'
   haven't gotten real world feedback on that yet ...
  
  We are very dependent upon all stakeholders including yourself 
  to review, test and comment upon this infrastructure as it is 
  proposed and merged. If something is proposed which will not 
  suit your requirements then it is important that we hear about 
  it, in detail, at the earliest possible time.
 
 okay, good to hear that I'm still considered a stakeholder 
 
 will try to focus the feedback and cc as many folks
 as possible, as it seems that some feedback is lost
 on the way upstream ...
 
 best,
 Herbert
 
  Thanks.
 

-- 
Thanks,
Dmitry.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: Network virtualization/isolation

2006-12-09 Thread Kir Kolyshkin

Herbert Poetzl wrote:

On Fri, Dec 08, 2006 at 10:13:48PM -0800, Andrew Morton wrote:
  


It's actually happening quite gradually and carefully.



hmm, I must have missed a testing phase for the
IPC namespace then, not that I think it is broken
(well, maybe it is, we do not know yet)

  
You have announced at LKML that Linux-VServer now uses the stuff that 
was merged in 2.6.19-rc1, haven't you? I suppose that means you are 
using IPC namespaces from mainstream? Isn't that considered testing? Or 
you don't test Linux-VServer? Please clarify, I'm a bit lost here.


Speaking of OpenVZ, as Kirill Korotaev said before we have backported 
all that to 2.6.18 back in September and are using it since then. And 
yes, we found a bug in IPC namespaces, and fix from Pavel Emelyanov has 
made it to 2.6.19-rc5 (see commit 
c7e12b838989b0e432c7a1cdf1e6c6fd936007f6 to linux-2.6-git).


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: Network virtualization/isolation

2006-12-09 Thread Herbert Poetzl
On Sun, Dec 10, 2006 at 01:34:14AM +0300, Kir Kolyshkin wrote:
 Herbert Poetzl wrote:
 On Fri, Dec 08, 2006 at 10:13:48PM -0800, Andrew Morton wrote:
   
 
 It's actually happening quite gradually and carefully.
 
 
 hmm, I must have missed a testing phase for the
 IPC namespace then, not that I think it is broken
 (well, maybe it is, we do not know yet)
 

 You have announced at LKML that Linux-VServer now uses the 
 stuff that was merged in 2.6.19-rc1, haven't you? 

yes, correct, and we already fixed several issues
the changes caused, both in handling as well as
functionality

 I suppose that means you are  using IPC namespaces from 
 mainstream? 

yes, we do 

 Isn't that considered testing? 

of course it is testing, but it is already in
mainstream, and for my part, I wasn't able to
provide feedback from testing yet ...

 Or you don't test Linux-VServer?

we do the same testing you folks do IIRC
(i.e. some secret test procedure which takes
roughly a week or so, after which we can tell
that everything works as expected :)

 Please clarify, I'm a bit lost here.


 Speaking of OpenVZ, as Kirill Korotaev said before we have 
 backported all that to 2.6.18 back in September 

nice, but what relevance has that for 2.6.19?

 and are using it since then.

cool, how much feedback regarding IPC did you get
since then?

 And yes, we found a bug in IPC namespaces, and fix from 
 Pavel Emelyanov has made it to 2.6.19-rc5 (see commit 
 c7e12b838989b0e432c7a1cdf1e6c6fd936007f6 to linux-2.6-git).

it's good that some bugs have been found, but
of what relevance is that for testing mainline
patches?

 - typical linux users will only excercise a 
   small fraction of the new code, if at all
 - virtualization solutions like OpenVZ and
   Linux-VServer add their custom modifications
   and/or adjustments, and serve a much smaller
   userbase
 - I haven't seen any test suites or similar
   for the spaces

so it all boils down to waiting for somebody to
stumble over an issue, which then will get fixed
just that the number of folks testing that is
quite small compared to 'other' mainline pathes

anyway, originally I was just answering to an
email pushing for 'fast' inclusion, which I do
not consider a good idea (as I already stated)

best,
Herbert

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: Network virtualization/isolation

2006-12-08 Thread Eric W. Biederman
Herbert Poetzl [EMAIL PROTECTED] writes:

 But, ok, it is not the real point to argue so much imho and waste our
 time instead of doing things.

 well, IMHO better talk (and think) first, then implement
 something ... not the other way round, and then start
 fixing up the mess ...

Well we need a bit of both.

This is thankfully not exported to user space, so as long as our
implementation is correct it doesn't much matter.

I do agree with the point that context may make sense.  I have
yet to be convinced though.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: Network virtualization/isolation

2006-12-08 Thread Herbert Poetzl
On Fri, Dec 08, 2006 at 12:57:49PM -0700, Eric W. Biederman wrote:
 Herbert Poetzl [EMAIL PROTECTED] writes:
 
  But, ok, it is not the real point to argue so much imho 
  and waste our time instead of doing things.

  well, IMHO better talk (and think) first, then implement
  something ... not the other way round, and then start
  fixing up the mess ...
 
 Well we need a bit of both.

hmm, are 'we' in a hurry here?

until recently, 'Linux' (mainline) didn't even want
to hear about OS Level virtualization, now there
is a rush to quickly get 'something' in, not knowing
or caring if it is usable at all?

I think there are a lot of 'potential users' for
this kind of virtualization, and so 'we' can test
almost all aspects outside of mainline, and once
we know the stuff works as expected, then we can
integrate it ...

the UTS namespace was something 'we all' had already
implemented in this (or a very similar) way, and in
one or two interations, it should actually work as 
expected. nevertheless, it was one of the simplest
spaces ...

we do not yet know the details for the IPC namespace,
as IPC is not that easy to check as UTS, and 'we'
haven't gotten real world feedback on that yet ...

so personally I think we should start some serious
testing on the upcoming namespaces, and we should
continue discussing the various approaches, until
'we' can agree on the (almost) 'perfect' solution

 This is thankfully not exported to user space, so as long 
 as our implementation is correct it doesn't much matter.

that's something I do not really agree with, stuff
integrated into the kernel should be well designed
and it should be tested ...

best,
Herbert

 I do agree with the point that context may make sense. 
 I have yet to be convinced though.
 
 Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: Network virtualization/isolation

2006-12-08 Thread Andrew Morton
On Sat, 9 Dec 2006 04:50:02 +0100
Herbert Poetzl [EMAIL PROTECTED] wrote:

 On Fri, Dec 08, 2006 at 12:57:49PM -0700, Eric W. Biederman wrote:
  Herbert Poetzl [EMAIL PROTECTED] writes:
  
   But, ok, it is not the real point to argue so much imho 
   and waste our time instead of doing things.
 
   well, IMHO better talk (and think) first, then implement
   something ... not the other way round, and then start
   fixing up the mess ...
  
  Well we need a bit of both.
 
 hmm, are 'we' in a hurry here?
 
 until recently, 'Linux' (mainline) didn't even want
 to hear about OS Level virtualization, now there
 is a rush to quickly get 'something' in, not knowing
 or caring if it is usable at all?

It's actually happening quite gradually and carefully.

 I think there are a lot of 'potential users' for
 this kind of virtualization, and so 'we' can test
 almost all aspects outside of mainline, and once
 we know the stuff works as expected, then we can
 integrate it ...
 
 the UTS namespace was something 'we all' had already
 implemented in this (or a very similar) way, and in
 one or two interations, it should actually work as 
 expected. nevertheless, it was one of the simplest
 spaces ...
 
 we do not yet know the details for the IPC namespace,
 as IPC is not that easy to check as UTS, and 'we'
 haven't gotten real world feedback on that yet ...

We are very dependent upon all stakeholders including yourself to review,
test and comment upon this infrastructure as it is proposed and merged.
If something is proposed which will not suit your requirements then it
is important that we hear about it, in detail, at the earliest possible time.

Thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: Network virtualization/isolation

2006-12-08 Thread Herbert Poetzl
On Fri, Dec 08, 2006 at 10:13:48PM -0800, Andrew Morton wrote:
 On Sat, 9 Dec 2006 04:50:02 +0100
 Herbert Poetzl [EMAIL PROTECTED] wrote:
 
  On Fri, Dec 08, 2006 at 12:57:49PM -0700, Eric W. Biederman wrote:
   Herbert Poetzl [EMAIL PROTECTED] writes:
   
But, ok, it is not the real point to argue so much imho 
and waste our time instead of doing things.
  
well, IMHO better talk (and think) first, then implement
something ... not the other way round, and then start
fixing up the mess ...
   
   Well we need a bit of both.
  
  hmm, are 'we' in a hurry here?
  
  until recently, 'Linux' (mainline) didn't even want
  to hear about OS Level virtualization, now there
  is a rush to quickly get 'something' in, not knowing
  or caring if it is usable at all?
 
 It's actually happening quite gradually and carefully.

hmm, I must have missed a testing phase for the
IPC namespace then, not that I think it is broken
(well, maybe it is, we do not know yet)

  I think there are a lot of 'potential users' for
  this kind of virtualization, and so 'we' can test
  almost all aspects outside of mainline, and once
  we know the stuff works as expected, then we can
  integrate it ...
  
  the UTS namespace was something 'we all' had already
  implemented in this (or a very similar) way, and in
  one or two interations, it should actually work as 
  expected. nevertheless, it was one of the simplest
  spaces ...
  
  we do not yet know the details for the IPC namespace,
  as IPC is not that easy to check as UTS, and 'we'
  haven't gotten real world feedback on that yet ...
 
 We are very dependent upon all stakeholders including yourself 
 to review, test and comment upon this infrastructure as it is 
 proposed and merged. If something is proposed which will not 
 suit your requirements then it is important that we hear about 
 it, in detail, at the earliest possible time.

okay, good to hear that I'm still considered a stakeholder 

will try to focus the feedback and cc as many folks
as possible, as it seems that some feedback is lost
on the way upstream ...

best,
Herbert

 Thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: Network virtualization/isolation

2006-12-06 Thread Kirill Korotaev
If there is a better and less intrusive while still being obvious
method I am all for it.  I do not like the OpenVZ thing of doing the
lookup once and then stashing the value in current and the special
casing the exceptions.

Why?
 
 
 I like it when things are obvious and not implied.
 
 The implementations seems to favor fewer lines of code touched over
 maintainability of the code.  Which if you are maintaining out of
 tree code is fine.  At leas that was my impression last time
 I looked at the code.
FYI, when we started doing networking virtualization many years ago
we tried both approaches.
Over time, context notion looked much more natural and easier for us.
Even Alexey Kuznetsov tells that he prefers exec_env as the logic
becomes very clear and little mess is introduced.

 I know there are a lot of silly things in the existing implementations
 because they were initially written without the expectation of being
 able to merge the code into the main kernel.  This resulted in some
 non-general interfaces, and a preference for patches that touch
 as few lines of code as possible.  
Sure, but OpenVZ code is being constantly cleaned from such code
and we are open for discussion. No one pretends that code is perferct
from the beginning.

 Anyway this has bit has been discussed before and we can discuss it
 seriously in the context of patch review.
Let me explain when explicit context like exec_env IMHO is cleaner:
- context is a natural notion of linux kernel. e.g. current.
  why not pass 'current' to all the functions as an argument
  starting from entry.S?
  in_atomic(), in_interrupt() etc. all these functions deal with current 
context.
  IMHO when one needs to pass an argument too many times like 'current'
  it is better to use a notion of the context.
- e.g. NFS should set networking context of the mount point or socket.
But, ok, it is not the real point to argue so much imho and waste our time 
instead of
doing things.

Thanks,
Kirill

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: Network virtualization/isolation

2006-12-06 Thread Herbert Poetzl
On Wed, Dec 06, 2006 at 02:54:16PM +0300, Kirill Korotaev wrote:
 If there is a better and less intrusive while still being obvious
 method I am all for it.  I do not like the OpenVZ thing of doing the
 lookup once and then stashing the value in current and the special
 casing the exceptions.
 
 Why?
  
  
  I like it when things are obvious and not implied.
  
  The implementations seems to favor fewer lines of code touched over
  maintainability of the code.  Which if you are maintaining out of
  tree code is fine.  At leas that was my impression last time
  I looked at the code.

 FYI, when we started doing networking virtualization many years ago
 we tried both approaches.
 Over time, context notion looked much more natural and easier for us.
 Even Alexey Kuznetsov tells that he prefers exec_env as the logic
 becomes very clear and little mess is introduced.
 
  I know there are a lot of silly things in the existing implementations
  because they were initially written without the expectation of being
  able to merge the code into the main kernel.  This resulted in some
  non-general interfaces, and a preference for patches that touch
  as few lines of code as possible.  
 Sure, but OpenVZ code is being constantly cleaned from such code
 and we are open for discussion. No one pretends that code is perferct
 from the beginning.
 
  Anyway this has bit has been discussed before and we can discuss it
  seriously in the context of patch review.
 Let me explain when explicit context like exec_env IMHO is cleaner:
 - context is a natural notion of linux kernel. e.g. current.
   why not pass 'current' to all the functions as an argument
   starting from entry.S?
   in_atomic(), in_interrupt() etc. all these functions deal with
   current context. IMHO when one needs to pass an argument too many
   times like 'current'
   it is better to use a notion of the context.
 - e.g. NFS should set networking context of the mount point or socket.

how would that work for a 'shared' NFS partition?
(shared between different context)

 But, ok, it is not the real point to argue so much imho and waste our
 time instead of doing things.

well, IMHO better talk (and think) first, then implement
something ... not the other way round, and then start
fixing up the mess ...

best,
Herbert

 Thanks,
 Kirill
 
 ___
 Containers mailing list
 [EMAIL PROTECTED]
 https://lists.osdl.org/mailman/listinfo/containers
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-12-04 Thread Daniel Lezcano

Hi Jamal,

thanks for taking the time read the document.

The objective of the document was not to convince one approach is better 
than other. I wanted to show the pros and the cons of each approach and 
to point that the 2 approaches are complementary.


Currently, there are some resources moved to a namespace relative 
access, the IPC and the utsname and this is into the 2.6.19 kernel.

The work on the pid namespace is still in progress.

The idea is to use a clone approach relying on the unshare_ns 
syscall. The syscall is called with a set of flags for pids, ipcs, 
utsname, network ... You can then unshare only the network and have an 
application into its own network environment.


For a l3 approach, like a l2, you can run an apache server into a 
unshared network environment. Better, you can run several apaches server 
into several network namespaces without modifying the server's network 
configuration.


Some of us, consider l2 as perfectly adapted for some kind of containers 
 like system containers and some kind of application containers running 
big servers, but find the l2 too much (seems to be a hammer to crush a 
beetle) for simple network requirements like for network migration, 
jails or containers which does not take care of such virtualization. For 
example, you want to create thousands of containers for a cluster of HPC 
jobs and just to have migration for these jobs. Does it make sense to 
have l2 approach ?


Dmitry Mishin and I, we thought about a l2/l3 solution and we thing we 
found a solution to have the 2 at runtime. Roughly, it is a l3 based on 
bind filtering and socket isolation, very similar to what vserver 
provides. I did a prototype, and it works well for IPV4/unicast.


So, considering, we have a l2 isolation/virtualization, and having a l3 
relying on the l2 network isolation resources subset. Is it an 
acceptable solution ?


  -- Daniel


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-12-04 Thread Eric W. Biederman
jamal [EMAIL PROTECTED] writes:

 I have removed the Re: just to add some freshness to the discussion

 So i read quickly the rest of the discussions. I was almost suprised to
 find that i agree with Eric on a lot of opinions (we also agree that
 vindaloo is good for you i guess);-
 The two issues that stood out for me (in addition to what i already said
 below):

 1) the solution must ease the migration of containers; i didnt see
 anything about migrating them to another host across a network, but i
 assume that this is a given.

It is mostly a given.  It is a goal for some of us and not for others.
Containers are a necessary first step to getting migration and 
checkpoint/restart
assistance from the kernel.

 2) the socket level bind/accept filtering with multiple IPs. From
 reading what Herbert has, it seems they have figured a clever way to
 optimize this path albeit some challenges (speacial casing for raw
 filters) etc.

 I am wondering if one was to use the two level muxing of the socket
 layer, how much more performance improvement the above scheme provides
 for #2? 

I don't follow this question. 

 Consider the case of L2 where by the time the packet hits the socket
 layer on incoming, the VE is already known; in such a case, the lookup
 would be very cheap. The advantage being you get rid of the speacial
 casing altogether. I dont see any issues with binds per multiple IPs etc
 using such a technique.

 For the case of #1 above, wouldnt it be also easier if the tables for
 netdevices, PIDs etc were per VE (using the 2 level mux)?

Generally yes.  s/VE/namespace/.  There is a case with hash tables where
it seems saner to add an additional entry because hash it is hard to dynamically
allocate a hash table, (because they need something large then a
single page allocation).  But for everything else yes it makes things
much easier if you have a per namespace data structure.  

A practical question is can we replace hash tables with some variant of
trie or radix-tree and not take a performance hit.  Given the better scaling of
tress to different workload sizes if we can use them so much the
better.  Especially because a per namespace split gives us a lot of
good properties.

 In any case, folks, i hope i am not treading on anyones toes; i know
 each one of you has implemented and has users and i am trying to be as 
 neutral as i can (but clearly biased;-).

Well we rather expect to bash heads until we can come up with something
we all can agree on with the people who more regularly have to maintain
the code.  The discussions so far have largely been warm ups, to actually
doing something.

Getting feedback from people who regularly work with the networking stack
is appreciated.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-12-04 Thread jamal
Daniel,

On Mon, 2006-04-12 at 11:18 +0100, Daniel Lezcano wrote:
 Hi Jamal,


 Currently, there are some resources moved to a namespace relative 
 access, the IPC and the utsname and this is into the 2.6.19 kernel.
 The work on the pid namespace is still in progress.
 
 The idea is to use a clone approach relying on the unshare_ns 
 syscall. The syscall is called with a set of flags for pids, ipcs, 
 utsname, network ... You can then unshare only the network and have an 
 application into its own network environment.
 

Ok, so i take it this call is used by the setup manager on the host
side?

 For a l3 approach, like a l2, you can run an apache server into a 
 unshared network environment. Better, you can run several apaches server 
 into several network namespaces without modifying the server's network 
 configuration.
 

ok - as i understand it now, this will be the case for all the
approaches taken?

 Some of us, consider l2 as perfectly adapted for some kind of containers 
   like system containers and some kind of application containers running 
 big servers, but find the l2 too much (seems to be a hammer to crush a 
 beetle) for simple network requirements like for network migration, 
 jails or containers which does not take care of such virtualization. For 
 example, you want to create thousands of containers for a cluster of HPC 
 jobs and just to have migration for these jobs. Does it make sense to 
 have l2 approach ?
 

Perhaps not for the specific app you mentioned above.
But it makes sense for what i described as virtual routers/bridges.
I would say that the solution has to cater for a variety of
applications, no?

 Dmitry Mishin and I, we thought about a l2/l3 solution and we thing we 
 found a solution to have the 2 at runtime. Roughly, it is a l3 based on 
 bind filtering and socket isolation, very similar to what vserver 
 provides. I did a prototype, and it works well for IPV4/unicast.
 

ok - so you guys seem to be reaching at least some consensus then.

 So, considering, we have a l2 isolation/virtualization, and having a l3 
 relying on the l2 network isolation resources subset. Is it an 
 acceptable solution ?

As long as you can be generic enough so that a wide array of apps can be
met, it should be fine. For a test app, consider the virtual
bridges/routers i mentioned. 
The other requirement i would see is that apps that would run on a host
would run unchanged. The migration of containers you folks seem to be
having under control - my only input into that thought since it is early
enough, you may want to build your structuring in such a way that this
is easy to do.

cheers,
jamal


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-12-04 Thread jamal
On Mon, 2006-04-12 at 05:15 -0700, Eric W. Biederman wrote:
 jamal [EMAIL PROTECTED] writes:
 

 Containers are a necessary first step to getting migration and 
 checkpoint/restart
 assistance from the kernel.

Isnt it like a MUST have if you are doing things from scratch instead of
it being an after thought.

 
  2) the socket level bind/accept filtering with multiple IPs. From
  reading what Herbert has, it seems they have figured a clever way to
  optimize this path albeit some challenges (speacial casing for raw
  filters) etc.
 
  I am wondering if one was to use the two level muxing of the socket
  layer, how much more performance improvement the above scheme provides
  for #2? 
 
 I don't follow this question. 

if you had the sockets tables being in two level mux, first level to
hash on namespace which leads to an indirection pointer to the table
to find the socket and its bindings (with zero code changes to the
socket code), then isnt this fast enough? Clearly you can optimize as
in the case of bind/accept filtering, but then you may have to do that
for every socket family/protocol (eg netlink doesnt have IP addresses,
but the binding to multiple groups is possible)

Am i making any more sense? ;-

  Consider the case of L2 where by the time the packet hits the socket
  layer on incoming, the VE is already known; in such a case, the lookup
  would be very cheap. The advantage being you get rid of the speacial
  casing altogether. I dont see any issues with binds per multiple IPs etc
  using such a technique.
 
  For the case of #1 above, wouldnt it be also easier if the tables for
  netdevices, PIDs etc were per VE (using the 2 level mux)?
 
 Generally yes.  s/VE/namespace/.  There is a case with hash tables where
 it seems saner to add an additional entry because hash it is hard to 
 dynamically
 allocate a hash table, (because they need something large then a
 single page allocation).  

A page to store the namespace indirection hash doesnt seem to be such a
big waste; i wonder though why you even need a page. If i had 256 hash
buckets with 1024 namespaces, it is still not too much of an overhead.

 But for everything else yes it makes things
 much easier if you have a per namespace data structure.  

Ok, I am sure youve done the research; i am just being a devils
advocate.

 A practical question is can we replace hash tables with some variant of
 trie or radix-tree and not take a performance hit.  Given the better scaling 
 of
 tress to different workload sizes if we can use them so much the
 better.  Especially because a per namespace split gives us a lot of
 good properties.

Is there a patch somewhere i can stare at that you guys agree on?

 Well we rather expect to bash heads until we can come up with something
 we all can agree on with the people who more regularly have to maintain
 the code.  The discussions so far have largely been warm ups, to actually
 doing something.
 
 Getting feedback from people who regularly work with the networking stack
 is appreciated.

I hope i am being helpful;
It seems to me that folks doing the different implementations may have
had different apps in mind. IMO, as long as the solution caters for all
apps (can you do virtual bridges/routers?), then we should be fine.
Intrusiveness may not be so bad if it needs to be done once. I have to
say i like the approach where the core code and algorithms are
untouched. Thats why i am humping on the two level mux approach, where
one level is to mux and find the namespace indirection and the second
step is to use the current datastructures and algorithms as is. I dont
know how much more cleaner or less intrusive you can be compared to
that. If i compile out the first level mux, I have my old net stack as
is, untouched.

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-12-04 Thread Dmitry Mishin
On Sunday 03 December 2006 19:00, Eric W. Biederman wrote:
 Ok.  Just a quick summary of where I see the discussion.

 We all agree that L2 isolation is needed at some point.
As we all agreed on this, may be it is time to send patches one-by-one?
For the beggining, I propose to resend Cedric's empty namespace patch as base 
for others - it is really empty, but necessary in order to move further.

After this patch and the following net namespace unshare patch will be 
accepted, I could send network devices virtualization patches for review and 
discussion.

What do you think?


 The approaches discussed for L2 and L3 are sufficiently orthogonal
 that we can implement then in either order.  You would need to
 unshare L3 to unshare L2, but if we think of them as two separate
 namespaces we are likely to be in better shape.

 The L3 discussion still has the problem that there has not been
 agreement on all of the semantics yet.

 More comments after I get some sleep.

 Eric
 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Thanks,
Dmitry.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-12-04 Thread Eric W. Biederman
jamal [EMAIL PROTECTED] writes:

 On Mon, 2006-04-12 at 05:15 -0700, Eric W. Biederman wrote:
 jamal [EMAIL PROTECTED] writes:
 

 Containers are a necessary first step to getting migration and
 checkpoint/restart
 assistance from the kernel.

 Isnt it like a MUST have if you are doing things from scratch instead of
 it being an after thought.

Having the proper semantics is a MUST, which generally makes those
a requirement to get consensus and to build the general mergeable
solution.

The logic for serializing the state is totally uninteresting
for the first pass at containers.  The applications inside the
containers simply don't care.

There are two basic techniques for containers.
1) Name filtering.  
   Where you keep the same global identifiers as you do now, but
   applications inside the container are only allowed to deal with a
   subset of those names.  The current vserver layer 3 networking
   approach is a handy example of this.  But this can apply to process
   ids and just about everything else. 

2) Independent namespaces. (Name duplication)
   Where you allow the same global name to refer to two different
   objects at the same time, with the context the reference comes
   being used to resolve which global object you are talking about.

Independent namespaces are the only core requirement for migration,
because the ensure when you get to the next machine you don't have
a conflict with your global names.

So at this point simply allowing duplicate names is the only
requirement for migration.  But yes that part is a MUST.

  2) the socket level bind/accept filtering with multiple IPs. From
  reading what Herbert has, it seems they have figured a clever way to
  optimize this path albeit some challenges (speacial casing for raw
  filters) etc.
 
  I am wondering if one was to use the two level muxing of the socket
  layer, how much more performance improvement the above scheme provides
  for #2? 
 
 I don't follow this question. 

 if you had the sockets tables being in two level mux, first level to
 hash on namespace which leads to an indirection pointer to the table
 to find the socket and its bindings (with zero code changes to the
 socket code), then isnt this fast enough? Clearly you can optimize as
 in the case of bind/accept filtering, but then you may have to do that
 for every socket family/protocol (eg netlink doesnt have IP addresses,
 but the binding to multiple groups is possible)

 Am i making any more sense? ;-

Yes.  As far as I can tell this is what we are doing and generally
it doesn't even require a hash to get the namespace.  Just an appropriate
place to look for the pointer to the namespace structure.

The practical problem with socket lookup is that is a hash table today,
allocating the top level of that hash table dynamically at run-time looks
problematic, as it is more than a single page.

  Consider the case of L2 where by the time the packet hits the socket
  layer on incoming, the VE is already known; in such a case, the lookup
  would be very cheap. The advantage being you get rid of the speacial
  casing altogether. I dont see any issues with binds per multiple IPs etc
  using such a technique.
 
  For the case of #1 above, wouldnt it be also easier if the tables for
  netdevices, PIDs etc were per VE (using the 2 level mux)?
 
 Generally yes.  s/VE/namespace/.  There is a case with hash tables where
 it seems saner to add an additional entry because hash it is hard to
 dynamically
 allocate a hash table, (because they need something large then a
 single page allocation).  

 A page to store the namespace indirection hash doesnt seem to be such a
 big waste; i wonder though why you even need a page. If i had 256 hash
 buckets with 1024 namespaces, it is still not too much of an overhead.

Not for namespaces, the problem is for existing hash tables, like the
ipv4 routing cache, and for the sockets...

 But for everything else yes it makes things
 much easier if you have a per namespace data structure.  

 Ok, I am sure youve done the research; i am just being a devils
 advocate.

I don't think we have gone far enough to prove what has good performance.

 A practical question is can we replace hash tables with some variant of
 trie or radix-tree and not take a performance hit.  Given the better scaling
 of
 tress to different workload sizes if we can use them so much the
 better.  Especially because a per namespace split gives us a lot of
 good properties.

 Is there a patch somewhere i can stare at that you guys agree on?

For non networking stuff you can look at the uts and ipc namespaces
that have been merged into 2.6.19.  There is also the struct pid work
that is a lead up to the pid namespace.

We have very carefully broken the problem by subsystem so we can do
incremental steps to get container support into the kernel.

That I don't think is the answer you want I think you are looking
for networking stack agreement.  If we had that we would be submitting

Re: Network virtualization/isolation

2006-12-04 Thread Eric W. Biederman
Dmitry Mishin [EMAIL PROTECTED] writes:

 On Sunday 03 December 2006 19:00, Eric W. Biederman wrote:
 Ok.  Just a quick summary of where I see the discussion.

 We all agree that L2 isolation is needed at some point.
 As we all agreed on this, may be it is time to send patches one-by-one?
 For the beggining, I propose to resend Cedric's empty namespace patch as base 
 for others - it is really empty, but necessary in order to move further.

 After this patch and the following net namespace unshare patch will be 
 accepted, I could send network devices virtualization patches for review and 
 discussion.

 What do you think?

I think sending out these patches for review sounds great.

For merge order I think enabling the unshare/clone flags to anyone
but developers should be about the last thing we do.

Starting with clone/unshare sounds to me like hitching up the cart
before it is built.

I really need to focus on finishing up the pid namespace, so except
for a little review and conversation I'm not going to help much on the
network side.

Of course I need to mess with unix domain sockets to properly
implement the pid namespace.  Because of the pid credential passing.

Eric

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-12-04 Thread Dmitry Mishin
On Monday 04 December 2006 18:35, Eric W. Biederman wrote:
[skip]
 Where and when you look to find the network namespace that applies to
 a packet is the primary difference between the OpenVZ L2
 implementation and my L2 implementation.

 If there is a better and less intrusive while still being obvious
 method I am all for it.  I do not like the OpenVZ thing of doing the
 lookup once and then stashing the value in current and the special
 casing the exceptions.
Why?

-- 
Thanks,
Dmitry.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-12-04 Thread Herbert Poetzl
On Mon, Dec 04, 2006 at 06:19:00PM +0300, Dmitry Mishin wrote:
 On Sunday 03 December 2006 19:00, Eric W. Biederman wrote:
  Ok.  Just a quick summary of where I see the discussion.
 
  We all agree that L2 isolation is needed at some point.

 As we all agreed on this, may be it is time to send patches
 one-by-one? For the beggining, I propose to resend Cedric's 
 empty namespace patch as base for others - it is really empty, 
 but necessary in order to move further.
 
 After this patch and the following net namespace unshare 
 patch will be accepted, 

well, I have neither seen any performance tests showing
that the following is true:

 - no change on network performance without the 
   space enabled
 - no change on network performance on the host
   with the network namespaces enabled
 - no measureable overhead inside the network
   namespace 
 - good scaleability for a larger number of network
   namespaces 

 I could send network devices virtualization patches for
 review and discussion.

that won't hurt ...

best,
Herbert

 What do you think?
 
  The approaches discussed for L2 and L3 are sufficiently orthogonal
  that we can implement then in either order.  You would need to
  unshare L3 to unshare L2, but if we think of them as two separate
  namespaces we are likely to be in better shape.
 
  The L3 discussion still has the problem that there has not been
  agreement on all of the semantics yet.
 
  More comments after I get some sleep.
 
  Eric
  -
  To unsubscribe from this list: send the line unsubscribe netdev in
  the body of a message to [EMAIL PROTECTED]
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 -- 
 Thanks,
 Dmitry.
 ___
 Containers mailing list
 [EMAIL PROTECTED]
 https://lists.osdl.org/mailman/listinfo/containers
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-12-04 Thread Eric W. Biederman
Dmitry Mishin [EMAIL PROTECTED] writes:

 On Monday 04 December 2006 18:35, Eric W. Biederman wrote:
 [skip]
 Where and when you look to find the network namespace that applies to
 a packet is the primary difference between the OpenVZ L2
 implementation and my L2 implementation.

 If there is a better and less intrusive while still being obvious
 method I am all for it.  I do not like the OpenVZ thing of doing the
 lookup once and then stashing the value in current and the special
 casing the exceptions.
 Why?

I like it when things are obvious and not implied.

The implementations seems to favor fewer lines of code touched over
maintainability of the code.  Which if you are maintaining out of
tree code is fine.  At leas that was my impression last time
I looked at the code.

I know there are a lot of silly things in the existing implementations
because they were initially written without the expectation of being
able to merge the code into the main kernel.  This resulted in some
non-general interfaces, and a preference for patches that touch
as few lines of code as possible.  

Anyway this has bit has been discussed before and we can discuss it
seriously in the context of patch review.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-12-04 Thread Eric W. Biederman
Herbert Poetzl [EMAIL PROTECTED] writes:

 On Mon, Dec 04, 2006 at 06:19:00PM +0300, Dmitry Mishin wrote:
 On Sunday 03 December 2006 19:00, Eric W. Biederman wrote:
  Ok.  Just a quick summary of where I see the discussion.
 
  We all agree that L2 isolation is needed at some point.

 As we all agreed on this, may be it is time to send patches
 one-by-one? For the beggining, I propose to resend Cedric's 
 empty namespace patch as base for others - it is really empty, 
 but necessary in order to move further.
 
 After this patch and the following net namespace unshare 
 patch will be accepted, 

 well, I have neither seen any performance tests showing
 that the following is true:

  - no change on network performance without the 
space enabled
  - no change on network performance on the host
with the network namespaces enabled
  - no measureable overhead inside the network
namespace 
  - good scaleability for a larger number of network
namespaces 

Yes all important criteria for selecting the implementation.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-12-04 Thread Dmitry Mishin
On Monday 04 December 2006 19:43, Herbert Poetzl wrote:
 On Mon, Dec 04, 2006 at 06:19:00PM +0300, Dmitry Mishin wrote:
  On Sunday 03 December 2006 19:00, Eric W. Biederman wrote:
   Ok.  Just a quick summary of where I see the discussion.
  
   We all agree that L2 isolation is needed at some point.
 
  As we all agreed on this, may be it is time to send patches
  one-by-one? For the beggining, I propose to resend Cedric's
  empty namespace patch as base for others - it is really empty,
  but necessary in order to move further.
 
  After this patch and the following net namespace unshare
  patch will be accepted,

 well, I have neither seen any performance tests showing
 that the following is true:

  - no change on network performance without the
space enabled
  - no change on network performance on the host
with the network namespaces enabled
  - no measureable overhead inside the network
namespace
  - good scaleability for a larger number of network
namespaces
These questions are for complete L2 implementation, not for these 2 empty 
patches. If you need some data relating to Andrey's implementation, I'll get 
it. Which test do you accept?
 

  I could send network devices virtualization patches for
  review and discussion.

 that won't hurt ...

 best,
 Herbert

  What do you think?
 
   The approaches discussed for L2 and L3 are sufficiently orthogonal
   that we can implement then in either order.  You would need to
   unshare L3 to unshare L2, but if we think of them as two separate
   namespaces we are likely to be in better shape.
  
   The L3 discussion still has the problem that there has not been
   agreement on all of the semantics yet.
  
   More comments after I get some sleep.
  
   Eric
   -
   To unsubscribe from this list: send the line unsubscribe netdev in
   the body of a message to [EMAIL PROTECTED]
   More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
  --
  Thanks,
  Dmitry.
  ___
  Containers mailing list
  [EMAIL PROTECTED]
  https://lists.osdl.org/mailman/listinfo/containers

-- 
Thanks,
Dmitry.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-12-04 Thread Daniel Lezcano

Dmitry Mishin wrote:

On Monday 04 December 2006 19:43, Herbert Poetzl wrote:

On Mon, Dec 04, 2006 at 06:19:00PM +0300, Dmitry Mishin wrote:

On Sunday 03 December 2006 19:00, Eric W. Biederman wrote:

Ok.  Just a quick summary of where I see the discussion.

We all agree that L2 isolation is needed at some point.

As we all agreed on this, may be it is time to send patches
one-by-one? For the beggining, I propose to resend Cedric's
empty namespace patch as base for others - it is really empty,
but necessary in order to move further.

After this patch and the following net namespace unshare
patch will be accepted,

well, I have neither seen any performance tests showing
that the following is true:

 - no change on network performance without the
   space enabled
 - no change on network performance on the host
   with the network namespaces enabled
 - no measureable overhead inside the network
   namespace
 - good scaleability for a larger number of network
   namespaces
These questions are for complete L2 implementation, not for these 2 empty 
patches. If you need some data relating to Andrey's implementation, I'll get 
it. Which test do you accept?


tbench ?

With the following scenarii:

 * intra host communication (one time with IP on eth and one time with 
127.0.0.1)

 * inter host communication

Each time:
- a single network namespace
	- with 100 network namespace. 1 server communicating and 99 listening 
but doing nothing.



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-12-04 Thread Herbert Poetzl
On Mon, Dec 04, 2006 at 08:02:48PM +0300, Dmitry Mishin wrote:
 On Monday 04 December 2006 19:43, Herbert Poetzl wrote:
  On Mon, Dec 04, 2006 at 06:19:00PM +0300, Dmitry Mishin wrote:
   On Sunday 03 December 2006 19:00, Eric W. Biederman wrote:
Ok.  Just a quick summary of where I see the discussion.
   
We all agree that L2 isolation is needed at some point.
  
   As we all agreed on this, may be it is time to send patches
   one-by-one? For the beggining, I propose to resend Cedric's
   empty namespace patch as base for others - it is really empty,
   but necessary in order to move further.
  
   After this patch and the following net namespace unshare
   patch will be accepted,
 
  well, I have neither seen any performance tests showing
  that the following is true:
 
   - no change on network performance without the
 space enabled
   - no change on network performance on the host
 with the network namespaces enabled
   - no measureable overhead inside the network
 namespace
   - good scaleability for a larger number of network
 namespaces

 These questions are for complete L2 implementation, 
 not for these 2 empty patches. 

well, I fear that we will have lot of overhead
'sneaking' in via small patches (with almost
unnoticeable overhead) making the 2.6 branch slower
and slower (regarding networking) so IMHO a complete
solution should be drafted, and tested performance
wise, we can then adjust it and possibley improve
it, untill it shows no measureable overhead ...

but IMHO it should be developed 'outside' the kernel,
in small and reviewable pieces which are constantly
updated to match the recent kernels ... something
like stacked git or quilt ...

 If you need some data relating to Andrey's
 implementation, I'll get it. Which test do you accept?

hmm, I think a good mix of netperf, netpipe and
iperf would be a good start, probably network folks
know better tests to exercise the stack ... at least
I hope so ...

of course, a good explanation _why_ this or that
code path does not add overhead here or there is
nice to have too ...

best,
Herbert

   I could send network devices virtualization patches for
   review and discussion.
 
  that won't hurt ...
 
  best,
  Herbert
 
   What do you think?
  
The approaches discussed for L2 and L3 are sufficiently orthogonal
that we can implement then in either order.  You would need to
unshare L3 to unshare L2, but if we think of them as two separate
namespaces we are likely to be in better shape.
   
The L3 discussion still has the problem that there has not been
agreement on all of the semantics yet.
   
More comments after I get some sleep.
   
Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
   --
   Thanks,
   Dmitry.
   ___
   Containers mailing list
   [EMAIL PROTECTED]
   https://lists.osdl.org/mailman/listinfo/containers
 
 -- 
 Thanks,
 Dmitry.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-12-03 Thread jamal
On Wed, 2006-14-11 at 16:17 +0100, Daniel Lezcano wrote:
 The attached document describes the network isolation at the layer 2
 and at the layer 3 ..

Daniel,

I apologize for taking this long to get back to you. The document (I
hope) made it clear to me at least the difference between the two
approaches. So thanks for taking the time to put it together.

So here are my thoughts ...
I havent read the rest of the thread so i may be repeating some of the
discussion; i have time today, I will try to catchup with the
discussion.

* i think the L2 approach is the more complete of the two approaches:

It caters to more applications: eg i can have network elements such as
virtual bridges and routers. It doesnt seem like i can do that with the
L3 approach. I think this in itself is a powerful enough reason to
disqualify the L3 approach.

Leading from the above, I dont have to make _a single line of code
change_ to any of the network element management tools inside the
container. i.e i can just run quagga and OSPF and BGP will work as is or
the bridge daemon and STP will work as is or tc to control real
devices or ip to control real ip addresses. Virtual routers and
bridges are real world applications (if you want more info ask me or ask
google, she knows).

 This wasnt clear to me from the doc on the L3 side of things, so
please correct me: 
because of the pid virtualization in the L2 approach(openvz?) I can run
all applications as is. They just dont know they are running on a
virtual environment. To use an extreme example: if i picked apache as a
binary compiled 10 years ago, it will run on the L2 approach but not on
the L3 approach. Is this understanding correct?  I find it hard to
believe that the L3 approach wouldnt work this way - it may be just my
reading into the doc.

So lets say the approach taken is that of L2 (I am biased towards this
because i want to be able to do virtual bridges and routers). The
disadvantage of the L2 approach (or is it just the openvz?) approach is:

- it is clear theres a lot more code needed to allow for the two level
multiplexing every where. i.e first you mux to select the namespace then
you do other things like find a pid, netdevice, ip address etc. I am
also not sure how complete that code is; you clearly get everything
attached to netdevices for free (eg networkc scheduler block) - which is
nice in itself; but you may have to do the muxing code for other blocks.
If my understanding is correct everything in the net subsystem has this
mux levels already in place (at least with openvz)? I think each
subsystem may have its own merit discussed (eg the L3 tables with the
recent changes from Patrick allow up to 2^32 -1 tables, so a muxing
layer at L3 maybe unnecessary)
--- To me this 2 level muxing looks like a clean design in that there
is consistency (i.e no hack thats specific to just one sub-subsystem but
not others). With this approach one could imagine hardware support that
does the first level of muxing (selecting a namespace for you). This is
clearly already happening with NICs supporting several unicast MAC
addresses. 
I think the litmus test for this approach is the answer to the question:
If i compiled in the containers in and do not use the namespaces, how
much more overhead is there for the host path? I would hope that it is
as close to 0 as possible. It should certainly be 0 if i dont compile in
containers.

- The desire for many MAC addresses. I dont think this is a killer
issue. NICs are begining to show up which capabilities for many unicast
MACs; many current have multicast hardware tables that can be used for
stashing unicast MAC addresses; it has also been shown you can use
multicast MAC addresses and get away with it if there is no conflict
(protocols such as VRRP, CARP etc do this).

- Manageability from the host side. It seems to be more complex with the
L2 than with L3. But so what? These tools are written from scratch and
there is no backward compatibility baggage.

Ok, I am out of coffee for the last 10 minutes;- But above sit my views
worth about $0.02 Canadian (which is about $0.02 US these days).

I will try later to catch up with the discussion that started from
Daniels original posting.

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Network virtualization/isolation

2006-12-03 Thread jamal
I have removed the Re: just to add some freshness to the discussion

So i read quickly the rest of the discussions. I was almost suprised to
find that i agree with Eric on a lot of opinions (we also agree that
vindaloo is good for you i guess);-
The two issues that stood out for me (in addition to what i already said
below):

1) the solution must ease the migration of containers; i didnt see
anything about migrating them to another host across a network, but i
assume that this is a given.
2) the socket level bind/accept filtering with multiple IPs. From
reading what Herbert has, it seems they have figured a clever way to
optimize this path albeit some challenges (speacial casing for raw
filters) etc.

I am wondering if one was to use the two level muxing of the socket
layer, how much more performance improvement the above scheme provides
for #2? 
Consider the case of L2 where by the time the packet hits the socket
layer on incoming, the VE is already known; in such a case, the lookup
would be very cheap. The advantage being you get rid of the speacial
casing altogether. I dont see any issues with binds per multiple IPs etc
using such a technique.

For the case of #1 above, wouldnt it be also easier if the tables for
netdevices, PIDs etc were per VE (using the 2 level mux)?


In any case, folks, i hope i am not treading on anyones toes; i know
each one of you has implemented and has users and i am trying to be as 
neutral as i can (but clearly biased;-).

cheers,
jamal



On Sun, 2006-03-12 at 07:26 -0500, jamal wrote:
 On Wed, 2006-14-11 at 16:17 +0100, Daniel Lezcano wrote:
  The attached document describes the network isolation at the layer 2
  and at the layer 3 ..
 
 Daniel,
 
 I apologize for taking this long to get back to you. The document (I
 hope) made it clear to me at least the difference between the two
 approaches. So thanks for taking the time to put it together.
 
 So here are my thoughts ...
 I havent read the rest of the thread so i may be repeating some of the
 discussion; i have time today, I will try to catchup with the
 discussion.
 
 * i think the L2 approach is the more complete of the two approaches:
 
 It caters to more applications: eg i can have network elements such as
 virtual bridges and routers. It doesnt seem like i can do that with the
 L3 approach. I think this in itself is a powerful enough reason to
 disqualify the L3 approach.
 
 Leading from the above, I dont have to make _a single line of code
 change_ to any of the network element management tools inside the
 container. i.e i can just run quagga and OSPF and BGP will work as is or
 the bridge daemon and STP will work as is or tc to control real
 devices or ip to control real ip addresses. Virtual routers and
 bridges are real world applications (if you want more info ask me or ask
 google, she knows).
 
  This wasnt clear to me from the doc on the L3 side of things, so
 please correct me: 
 because of the pid virtualization in the L2 approach(openvz?) I can run
 all applications as is. They just dont know they are running on a
 virtual environment. To use an extreme example: if i picked apache as a
 binary compiled 10 years ago, it will run on the L2 approach but not on
 the L3 approach. Is this understanding correct?  I find it hard to
 believe that the L3 approach wouldnt work this way - it may be just my
 reading into the doc.
 
 So lets say the approach taken is that of L2 (I am biased towards this
 because i want to be able to do virtual bridges and routers). The
 disadvantage of the L2 approach (or is it just the openvz?) approach is:
 
 - it is clear theres a lot more code needed to allow for the two level
 multiplexing every where. i.e first you mux to select the namespace then
 you do other things like find a pid, netdevice, ip address etc. I am
 also not sure how complete that code is; you clearly get everything
 attached to netdevices for free (eg networkc scheduler block) - which is
 nice in itself; but you may have to do the muxing code for other blocks.
 If my understanding is correct everything in the net subsystem has this
 mux levels already in place (at least with openvz)? I think each
 subsystem may have its own merit discussed (eg the L3 tables with the
 recent changes from Patrick allow up to 2^32 -1 tables, so a muxing
 layer at L3 maybe unnecessary)
 --- To me this 2 level muxing looks like a clean design in that there
 is consistency (i.e no hack thats specific to just one sub-subsystem but
 not others). With this approach one could imagine hardware support that
 does the first level of muxing (selecting a namespace for you). This is
 clearly already happening with NICs supporting several unicast MAC
 addresses. 
 I think the litmus test for this approach is the answer to the question:
 If i compiled in the containers in and do not use the namespaces, how
 much more overhead is there for the host path? I would hope that it is
 as close to 0 as possible. It should certainly be 0 if i dont 

Re: Network virtualization/isolation

2006-12-03 Thread Eric W. Biederman

Ok.  Just a quick summary of where I see the discussion.

We all agree that L2 isolation is needed at some point.

The approaches discussed for L2 and L3 are sufficiently orthogonal
that we can implement then in either order.  You would need to
unshare L3 to unshare L2, but if we think of them as two separate
namespaces we are likely to be in better shape.

The L3 discussion still has the problem that there has not been
agreement on all of the semantics yet.

More comments after I get some sleep.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-12-03 Thread Herbert Poetzl
On Sun, Dec 03, 2006 at 07:26:02AM -0500, jamal wrote:
 On Wed, 2006-14-11 at 16:17 +0100, Daniel Lezcano wrote:
  The attached document describes the network isolation at the layer 2
  and at the layer 3 ..
 
 Daniel,
 
 I apologize for taking this long to get back to you. The document (I
 hope) made it clear to me at least the difference between the two
 approaches. So thanks for taking the time to put it together.
 
 So here are my thoughts ...
 I havent read the rest of the thread so i may be repeating some of the
 discussion; i have time today, I will try to catchup with the
 discussion.
 
 * i think the L2 approach is the more complete of the two approaches:
 
 It caters to more applications: eg i can have network elements such as
 virtual bridges and routers. It doesnt seem like i can do that with the
 L3 approach. I think this in itself is a powerful enough reason to
 disqualify the L3 approach.
 
 Leading from the above, I dont have to make _a single line of code
 change_ to any of the network element management tools inside the
 container. i.e i can just run quagga and OSPF and BGP will work as is or
 the bridge daemon and STP will work as is or tc to control real
 devices or ip to control real ip addresses. Virtual routers and
 bridges are real world applications (if you want more info ask me or ask
 google, she knows).
 
  This wasnt clear to me from the doc on the L3 side of things, so
 please correct me: 
 because of the pid virtualization in the L2 approach(openvz?) I can run
 all applications as is. They just dont know they are running on a
 virtual environment. To use an extreme example: if i picked apache as a
 binary compiled 10 years ago, it will run on the L2 approach but not on
 the L3 approach. Is this understanding correct?  I find it hard to
 believe that the L3 approach wouldnt work this way - it may be just my
 reading into the doc.

the 10 year old apache will run with layer 3 isolation
as well as with layer 2 virtualization (probably a little
faster though, we do not know yet :), because what it 
does is IP (layer 3) traffic ...

 So lets say the approach taken is that of L2 (I am biased towards this
 because i want to be able to do virtual bridges and routers). 
 The disadvantage of the L2 approach (or is it just the openvz?)
 approach is:
 
 - it is clear theres a lot more code needed to allow for the two level
 multiplexing every where. i.e first you mux to select the namespace then
 you do other things like find a pid, netdevice, ip address etc. I am
 also not sure how complete that code is; you clearly get everything
 attached to netdevices for free (eg networkc scheduler block) - which is
 nice in itself; but you may have to do the muxing code for other blocks.
 If my understanding is correct everything in the net subsystem has this
 mux levels already in place (at least with openvz)? I think each
 subsystem may have its own merit discussed (eg the L3 tables with the
 recent changes from Patrick allow up to 2^32 -1 tables, so a muxing
 layer at L3 maybe unnecessary)

 --- To me this 2 level muxing looks like a clean design in that there
 is consistency (i.e no hack thats specific to just one sub-subsystem but
 not others). With this approach one could imagine hardware support that
 does the first level of muxing (selecting a namespace for you). This is
 clearly already happening with NICs supporting several unicast MAC
 addresses. 

 I think the litmus test for this approach is the answer to the question:
 If i compiled in the containers in and do not use the namespaces, how
 much more overhead is there for the host path? I would hope that it is
 as close to 0 as possible. It should certainly be 0 if i dont compile in
 containers.

IMHO there are three cases to consider, to get valid
'performance' numbers:

 - host system with and without containers enabled
 - single guest (container) compared to host system _without_
 - bunch of guests (e.g. 10) compared to 10 apps/threads on host

one proven feature of the L3 isolation is that those
all end up with the same or even better performance

 - The desire for many MAC addresses. I dont think this is a killer
 issue. NICs are begining to show up which capabilities for many unicast
 MACs; many current have multicast hardware tables that can be used for
 stashing unicast MAC addresses; it has also been shown you can use
 multicast MAC addresses and get away with it if there is no conflict
 (protocols such as VRRP, CARP etc do this).
 
 - Manageability from the host side. It seems to be more complex with the
 L2 than with L3. But so what? These tools are written from scratch and
 there is no backward compatibility baggage.

well, no, actually the 'tools' to manage layer 3 isolation
are already there, and except for the 'setup' there is
nothing special to configure, as networking still lives
on the host

 Ok, I am out of coffee for the last 10 minutes;- But above sit my views
 worth about $0.02 Canadian (which is about $0.02 

Re: Network virtualization/isolation

2006-12-03 Thread jamal
On Sun, 2006-03-12 at 17:37 +0100, Herbert Poetzl wrote:
 On Sun, Dec 03, 2006 at 07:26:02AM -0500, jamal wrote:

 To use an extreme example: if i picked apache as a
  binary compiled 10 years ago, it will run on the L2 approach but not on
  the L3 approach. Is this understanding correct?  I find it hard to
  believe that the L3 approach wouldnt work this way - it may be just my
  reading into the doc.
 
 the 10 year old apache will run with layer 3 isolation
 as well as with layer 2 virtualization (probably a little
 faster though, we do not know yet :), because what it 
 does is IP (layer 3) traffic ...
 

Ok, thanks for clarifying this.

  I think the litmus test for this approach is the answer to the question:
  If i compiled in the containers in and do not use the namespaces, how
  much more overhead is there for the host path? I would hope that it is
  as close to 0 as possible. It should certainly be 0 if i dont compile in
  containers.
 
 IMHO there are three cases to consider, to get valid
 'performance' numbers:
 
  - host system with and without containers enabled
  - single guest (container) compared to host system _without_

Sound reasonable.

  - bunch of guests (e.g. 10) compared to 10 apps/threads on host
 

Your mileage may vary. For me trying to run virtual routers; this is
not an important test. I want to be able to have containers each running
quagga and OSPF. I cant achieve my goals with with 10 quaggas without
making some major changes to quagga.
 
 one proven feature of the L3 isolation is that those
 all end up with the same or even better performance

I think it is valuable to reduce the overhead. I think that it may be
reasonable to some threshold to trade a little performance for
genericity. What the threshold is, i dont know.

  - Manageability from the host side. It seems to be more complex with the
  L2 than with L3. But so what? These tools are written from scratch and
  there is no backward compatibility baggage.
 
 well, no, actually the 'tools' to manage layer 3 isolation
 are already there, 
 and except for the 'setup' there is
 nothing special to configure, as networking still lives
 on the host
 

I dont see the two as being separate issues. You must create container;
you must configure networking on them; it is forgivable to have the
second part of that process to involve some non-standard tools for the
containers (from the host).
It is not forgivable to have speacilized tools within the container.

 I would be interested in a config layout for a typical
 L3 isolation setup when you 'only' have L2 virtualization
 
  - typical host system with apache, mysql, postfix, ssh
and ftp is broken down into security contexts to
allow for increased security
  - as part of that process, the services are isolated,
while apache and ftp share the same ip [ip0], mysql
will be using a local one [ip1], and postfix/ssh a 
second public one [ip2]
 
 the L3 isolation approach is straight forward:
 
  - assign the two public ips to eth0, the local one
to lo or dummy0
  - create five isolation areas where 0 and 1 share ip0, 
2 uses ip1 and 3,4 uses ip2
 
 that's it, all will work as expected ... let's see with
 what L2 isolation example you come up with, which is 
 able to 'mimic' this setup ...
 
 note: no question it is possible to do that with L2

Unless i am misreading, isnt this merely a matter of configuring
on the container side eth0 (I think you are talking about VE side eth0
in your example above) two public ip addresses (or even two ethx
devices) and then attach IP addresses to them? mysql gets an lo address.
Would this not work?

Out of curiosity: assume we have a local LAN (perhaps something upstream
does NAT), is it possible to have the same IP address going to multiple
containers?

cheers,
jamal



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-12-02 Thread Kari Hurtta
[EMAIL PROTECTED] (Eric W. Biederman) writes in gmane.linux.network:

 Ok.  So on this point we agree.  Full isolation at the network device/L2 level
 is desirable and no one is opposed to that.
 
 There is however a strong feeling especially for the case of application
 containers that something more focused on what a non-privileged process can
 use and deal with would be nice.  The ``L3'' case.
 
 I agree that has potential but I worry about 2 things.
 - Premature optimization.
 - A poor choice of semantics.
 - Feature creep leading to insane semantics.
 
 I feel there is something in the L3 arguments as well and it sounds
 like it would be a good idea to flush out the semantics.
 
 For full network isolation we have the case that every process,
 every socket, and every network device belongs to a network namespace.
 This is enough to derive the network namespace for all other user
 visible data structures, and to a large extent to define their semantics.
 
 We still need a definition of the non-privileged case, that is compatible
 with the former definition.
 
 .
 
 What unprivileged user space gets to manipulate are sockets.  So perhaps
 we can break our model into a network socket namespace and network device
 namespace.  
 
 I would define it so that for each socket there is exactly one network
 socket namespace.  And for each network socket namespace there is exactly
 one network device namespace.
 
 The network socket namespace would be concerned with the rules for deciding
 which local addresses a socket can connect/accept/bind to.
 
 The network device namespace would be concerned with everything else.

There need decide one thing:  What is connection between  namespaces?

- Connection between the network device namespaces is bridge.

- What (socket) is connection between the network socket namespaces?

Connection inside on name namespace is clear I think.

 - Connection inside of network device namespaces is loopback device.

 - Connection inside of network socket namespaces is socket
   using loopback address(es)?

/ Kari Hurtta



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-12-02 Thread Kari Hurtta
Kari Hurtta [EMAIL PROTECTED] writes in gmane.linux.network:

 [EMAIL PROTECTED] (Eric W. Biederman) writes in gmane.linux.network:
 
  Ok.  So on this point we agree.  Full isolation at the network device/L2 
  level
  is desirable and no one is opposed to that.
  
  There is however a strong feeling especially for the case of application
  containers that something more focused on what a non-privileged process can
  use and deal with would be nice.  The ``L3'' case.
  
  I agree that has potential but I worry about 2 things.
  - Premature optimization.
  - A poor choice of semantics.
  - Feature creep leading to insane semantics.
  
  I feel there is something in the L3 arguments as well and it sounds
  like it would be a good idea to flush out the semantics.
  
  For full network isolation we have the case that every process,
  every socket, and every network device belongs to a network namespace.
  This is enough to derive the network namespace for all other user
  visible data structures, and to a large extent to define their semantics.
  
  We still need a definition of the non-privileged case, that is compatible
  with the former definition.
  
  .
  
  What unprivileged user space gets to manipulate are sockets.  So perhaps
  we can break our model into a network socket namespace and network device
  namespace.  
  
  I would define it so that for each socket there is exactly one network
  socket namespace.  And for each network socket namespace there is exactly
  one network device namespace.
  
  The network socket namespace would be concerned with the rules for deciding
  which local addresses a socket can connect/accept/bind to.
  
  The network device namespace would be concerned with everything else.
 
 There need decide one thing:  What is connection between  namespaces?
 
 - Connection between the network device namespaces is bridge.
 
 - What (socket) is connection between the network socket namespaces?
 
 Connection inside on name namespace is clear I think.
 
  - Connection inside of network device namespaces is loopback device.
 
  - Connection inside of network socket namespaces is socket
using loopback address(es)?


On other hand bridge between the network device namespaces need to set
on parent namespace.

So it is logical that connecting socket between network socket namespaces
is set on parent namespace.So connecting socket is any socket created 
on parent namespace?   (socketpair() for example? )


( Currently socketpair()  allows only create AF_UNIX  sockects, which
  are not part of network socket namespace, I think) 

/ Kari Hurtta

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: Network virtualization/isolation

2006-11-30 Thread Vlad Yasevich
Daniel Lezcano wrote:
 Brian Haley wrote:
 Eric W. Biederman wrote:
 I think for cases across network socket namespaces it should
 be a matter for the rules, to decide if the connection should
 happen and what error code to return if the connection does not
 happen.

 There is a potential in this to have an ambiguous case where two
 applications can be listening for connections on the same socket
 on the same port and both will allow the connection.  If that
 is the case I believe the proper definition is the first socket
 that we find that will accept the connection gets the connection.
 No. If you try to connect, the destination IP address is assigned to a
 network namespace. This network namespace is used to leave the listening
 socket ambiguity.

 Wouldn't you want to catch this at bind() and/or configuration time and
 fail?  Having overlapping namespaces/rules seems undesirable, since as
 Herbert said, can get you unexpected behaviour.
 
 Overlapping is not a problem, you can have several sockets binded on the
 same INADDR_ANY/port without ambiguity because the network namespace
 pointer is added as a new key for sockets lookup, (src addr, src port,
 dst addr, dst port, net ns pointer). The bind should not be forced to a
 specific address because you will not be able to connect via 127.0.0.1.

So, all this leads to me ask, how to handle 127.0.0.1?

For L2 it seems easy.  Each namespace gets a tagged lo device.
How do you propose to do it for L3, because disabling access to loopback is
not a valid option, IMO.

I agree that adding a namespace to the (using generic terms) TCB lookup 
solves the conflict issue.

-vlad
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: Network virtualization/isolation

2006-11-30 Thread Daniel Lezcano

Vlad Yasevich wrote:

Daniel Lezcano wrote:

Brian Haley wrote:

Eric W. Biederman wrote:

I think for cases across network socket namespaces it should
be a matter for the rules, to decide if the connection should
happen and what error code to return if the connection does not
happen.

There is a potential in this to have an ambiguous case where two
applications can be listening for connections on the same socket
on the same port and both will allow the connection.  If that
is the case I believe the proper definition is the first socket
that we find that will accept the connection gets the connection.

No. If you try to connect, the destination IP address is assigned to a
network namespace. This network namespace is used to leave the listening
socket ambiguity.

Wouldn't you want to catch this at bind() and/or configuration time and
fail?  Having overlapping namespaces/rules seems undesirable, since as
Herbert said, can get you unexpected behaviour.

Overlapping is not a problem, you can have several sockets binded on the
same INADDR_ANY/port without ambiguity because the network namespace
pointer is added as a new key for sockets lookup, (src addr, src port,
dst addr, dst port, net ns pointer). The bind should not be forced to a
specific address because you will not be able to connect via 127.0.0.1.


So, all this leads to me ask, how to handle 127.0.0.1?

For L2 it seems easy.  Each namespace gets a tagged lo device.
How do you propose to do it for L3, because disabling access to loopback is
not a valid option, IMO.


There are 2 options:

1 - Dmitry Mishin proposed to use the l2 mechanism and reinstantiate a 
new loopback device, I didn't tested that yet, perhaps there are issues 
with non-127.0.0.1 loopback traffic and routes creation, I don't know.


2 - add the pointer of the network namespace who has originated the 
packet into the skbuff when the traffic is for 127.0.0.1, so when the 
packet arrive to IP, it has the namespace destination information 
because source == destination. I tested it and it works fine without 
noticeable overhead and this can be done with a very few lines of code.


  -- Daniel

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: Network virtualization/isolation

2006-11-30 Thread Herbert Poetzl
On Thu, Nov 30, 2006 at 05:38:16PM +0100, Daniel Lezcano wrote:
 Vlad Yasevich wrote:
  Daniel Lezcano wrote:
  Brian Haley wrote:
  Eric W. Biederman wrote:
  I think for cases across network socket namespaces it should
  be a matter for the rules, to decide if the connection should
  happen and what error code to return if the connection does not
  happen.
 
  There is a potential in this to have an ambiguous case where two
  applications can be listening for connections on the same socket
  on the same port and both will allow the connection.  If that
  is the case I believe the proper definition is the first socket
  that we find that will accept the connection gets the connection.
  No. If you try to connect, the destination IP address is assigned to a
  network namespace. This network namespace is used to leave the listening
  socket ambiguity.
  Wouldn't you want to catch this at bind() and/or configuration time and
  fail?  Having overlapping namespaces/rules seems undesirable, since as
  Herbert said, can get you unexpected behaviour.
  Overlapping is not a problem, you can have several sockets binded on the
  same INADDR_ANY/port without ambiguity because the network namespace
  pointer is added as a new key for sockets lookup, (src addr, src port,
  dst addr, dst port, net ns pointer). The bind should not be forced to a
  specific address because you will not be able to connect via 127.0.0.1.
  
  So, all this leads to me ask, how to handle 127.0.0.1?
  
  For L2 it seems easy.  Each namespace gets a tagged lo device.
  How do you propose to do it for L3, because disabling access to loopback is
  not a valid option, IMO.
 
 There are 2 options:
 
 1 - Dmitry Mishin proposed to use the l2 mechanism and reinstantiate a 
 new loopback device, I didn't tested that yet, perhaps there are issues 
 with non-127.0.0.1 loopback traffic and routes creation, I don't know.
 
 2 - add the pointer of the network namespace who has originated the 
 packet into the skbuff when the traffic is for 127.0.0.1, so when the 
 packet arrive to IP, it has the namespace destination information 
 because source == destination. I tested it and it works fine without 
 noticeable overhead and this can be done with a very few lines of code.

there is a third option, which is a little 'hacky' but
works quite fine too:

 use different loopback addresses for each 'guest' e.g.
 127.x.y.z and 'map' them to 127.0.0.1 (or the other
 way round) whenever appropriate

advantages:
 - doesn't require any skb tagging
 - doesn't change the routing in any way
 - allows isolated loopback connections

disadvantages:
 - blocks those special addresses (127.x.y.z)
 - requires the mapping at bind/receive

best,
Herbert
 
-- Daniel
 
 ___
 Containers mailing list
 [EMAIL PROTECTED]
 https://lists.osdl.org/mailman/listinfo/containers
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-11-29 Thread Brian Haley

Eric W. Biederman wrote:

I think for cases across network socket namespaces it should
be a matter for the rules, to decide if the connection should
happen and what error code to return if the connection does not
happen.

There is a potential in this to have an ambiguous case where two
applications can be listening for connections on the same socket
on the same port and both will allow the connection.  If that
is the case I believe the proper definition is the first socket
that we find that will accept the connection gets the connection.


Wouldn't you want to catch this at bind() and/or configuration time and 
fail?  Having overlapping namespaces/rules seems undesirable, since as 
Herbert said, can get you unexpected behaviour.



I think with the appropriate set of rules it provides what is needed
for application migration.  I.e. 127.0.0.1 can be filtered so that
you can only connect to sockets in your current container.

It does get a little odd because it does allow for the possibility
that you can have multiple connected sockets with same source ip,
source port, destination ip, destination port.  If the rules are
setup appropriately.  I don't see that peculiarity being visible on
the outside network so it shouldn't be a problem.


So if they're using the same protocol (eg TCP), how is it decided which 
one gets an incoming packet?  Maybe I'm missing something as I don't 
understand your inside/outside network reference - is that to the 
loopback address comment in the previous paragraph?


Thanks,

-Brian
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: Network virtualization/isolation

2006-11-29 Thread Daniel Lezcano

Brian Haley wrote:

Eric W. Biederman wrote:

I think for cases across network socket namespaces it should
be a matter for the rules, to decide if the connection should
happen and what error code to return if the connection does not
happen.

There is a potential in this to have an ambiguous case where two
applications can be listening for connections on the same socket
on the same port and both will allow the connection.  If that
is the case I believe the proper definition is the first socket
that we find that will accept the connection gets the connection.
No. If you try to connect, the destination IP address is assigned to a 
network namespace. This network namespace is used to leave the listening 
socket ambiguity.


Wouldn't you want to catch this at bind() and/or configuration time and
fail?  Having overlapping namespaces/rules seems undesirable, since as
Herbert said, can get you unexpected behaviour.


Overlapping is not a problem, you can have several sockets binded on the 
same INADDR_ANY/port without ambiguity because the network namespace 
pointer is added as a new key for sockets lookup, (src addr, src port, 
dst addr, dst port, net ns pointer). The bind should not be forced to a 
specific address because you will not be able to connect via 127.0.0.1.





I think with the appropriate set of rules it provides what is needed
for application migration.  I.e. 127.0.0.1 can be filtered so that
you can only connect to sockets in your current container.

It does get a little odd because it does allow for the possibility
that you can have multiple connected sockets with same source ip,
source port, destination ip, destination port.  If the rules are
setup appropriately.  I don't see that peculiarity being visible on
the outside network so it shouldn't be a problem.


So if they're using the same protocol (eg TCP), how is it decided which
one gets an incoming packet?  Maybe I'm missing something as I don't
understand your inside/outside network reference - is that to the
loopback address comment in the previous paragraph?


The sockets for l3 isolation are isolated like the l2 (this is common 
code). The difference is where the network namespace is found and used.
At the layer 2, it is at the network device level where the namespace is 
found. At the layer 3, from the IP destination. So when you arrive to 
sockets level, you have the network namespace packet destination 
information and you search for sockets related to the specific namespace.



  -- Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-11-28 Thread Daniel Lezcano

Eric W. Biederman wrote:

[ snip ]



The packets arrive to the real device and go through the routes
engine. From this point, the used route is enough to know to which
container the traffic can go and the sockets subset assigned to the
container.


Note this has potentially the highest overhead of them all because
this is the only approach in which it is mandatory to inspect the
network packets to see which container they are in.


If the container is in the route information, when you use the route, 
you have the container destination with it. I don't see the overhead here.




My real problem with this approach besides seriously complicating
the administration by not delegating it is that you loose enormous
amounts of power.


I don't understand why you say administration is more complicated.
unshare - ifconfig

1 container = 1 IP

[ snip ]


So you have two columns that you rate these things that I disagree
with, and you left out what the implications are for code maintenance.

1) Network setup.
   Past a certainly point both bind filtering and Daniel's L3 use a new
   paradigm for managing the network code and become nearly impossible for
   system administrators to understand.  The classic one is routing packets
   between machines over the loopback interface by accident. Huh?


What is this new paradigm you are talking about ?



The L2. Network setup iss simply the cost of setting up a multiple
machine network.  This is more complicated but it is well understood
and well documented today.  Plus for the common cases it is easy to
get a tool to automate this for you.  When you get a complicated
network this wins hands down because the existing tools work and
you don't have to retrain your sysadmins to understand what is
happening.


unshare - (guest) add mac address
   (host) add mac address
   (guest) set ip address
   (host) set ip address
   (host) setup bridge

1 container = 2 net devices (root + guest), 2 IPs, 2 mac addresses, 1 
bridge.

100 containers = 200 net devices, 200 IPs, 200 mac addresses, 1 bridge.



2) Runtime Overhead.

Your analysis is confused. Bind/Accept filter is much cheaper than
doing a per packet evaluation in the route cache of which container
it belongs to.  Among other things Bind/Accept filtering allows all
of the global variables in the network stack to remain global and
only touches a slow path.  So it is both very simple and very cheap.

Next in line comes L2 using real network devices, and Daniel's
L3 thing.  Because there are multiple instances of the networking data
structures we have an extra pointer indirection.


There is not extra networking data structure instantiation in the 
Daniel's L3.


Finally we get L2 with an extra network stack traversal, because
we either need the full power of netfilter and traffic shaping
gating access to what a node is doing or we simply don't have
enough real network interfaces.  I assert that we can optimize
the lack of network interfaces away by optimizing the drivers
once this becomes an interesting case.

3) Long Term Code Maintenance Overhead.

- A pure L2 implementation.  There is a big one time cost of
  changing all of the variable accesses.  Once that transition
  is complete things just work.  All code is shared so there
  is no real overhead.

- Bind/Connect/Accept filtering.  There are so few places in
  the code this is easy to maintain without sharing code with
  everyone else.


For isolation too ? Can we build network migration on top of that ?



- Daniel's L3.  A big mass of special purpose code with peculiar
  semantics that no one else in the network stack cares about
  but is right in the middle of the code.


Thanks Eric for all your comments.

  -- Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-11-28 Thread Eric W. Biederman

I do not want to get into a big debate on the merits of various
techniques at this time.  We seem to be in basic agreement
about what we are talking about.

There is one thing I think we can all agree upon.
- Everything except isolation at the network device/L2 layer, does not
  allow guests to have the full power of the linux networking stack.

- There has been a demonstrated use for the full power of the linux
  networking stack in containers..

- There are a set of techniques which look as though they will give
  us full speed when we do isolation of the network stack at the
  network device/L2 layer.

Is there any reason why we don't want to implement network namespaces
without the full power of the linux network stack?

If there is a case where we clearly don't want the full power of the
linux network stack in a guest but we still need a namespace we can
start looking at the merits of the alternatives.

 What is this new paradigm you are talking about ?

The basic point is this.  The less like stock linux the inside of a
container looks, and the more of a special case it is the more
confusing it is.  The classic example is that for a system container
routing packets between containers over the loopback interface is
completely unexpected.


 There is not extra networking data structure instantiation in the
 Daniel's L3.
Nope just an extra field which serves the same purpose.

 - Bind/Connect/Accept filtering.  There are so few places in
   the code this is easy to maintain without sharing code with
   everyone else.

 For isolation too ? Can we build network migration on top of that ?

As long as you can take your globally visible network address with you
when you migrate you can build network migration on top of it.  So yes
bind/accept filtering is sufficient to implement migration, if you are
only using IP based protocols.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-11-28 Thread Herbert Poetzl
On Tue, Nov 28, 2006 at 09:51:57AM -0700, Eric W. Biederman wrote:
 
 I do not want to get into a big debate on the merits of various
 techniques at this time.  We seem to be in basic agreement
 about what we are talking about.
 
 There is one thing I think we can all agree upon.
 - Everything except isolation at the network device/L2 layer, does not
   allow guests to have the full power of the linux networking stack.
 
 - There has been a demonstrated use for the full power of the linux
   networking stack in containers..

  - There has been a demonstrated use for the full performance 
IP layer isolation too, both in BSD and Linux for several
years now ...

 - There are a set of techniques which look as though they will give
   us full speed when we do isolation of the network stack at the
   network device/L2 layer.
 
 Is there any reason why we don't want to implement network namespaces
 without the full power of the linux network stack?

duplicate negation -

  Is there any reason why we _want_ to implement network namespaces
   _with_ the full power of the linux network stack?

yes, I think you have some reasons for doing so, especially
the migration part seems to depend on it

OTOH, we _also_ want IP isolation, as it allows to separate
services (and even handle overlapping sets) in a very natural
(linux) way, without adding interfaces and virtual switches
and bridges at a potentially high overhead just to do simple
layer 3 isolation

 If there is a case where we clearly don't want the full power of the
 linux network stack in a guest but we still need a namespace we can
 start looking at the merits of the alternatives.

see above, of course, all cases can be 'simulated' by a
fully blown layer 2 virtualization, so that's not an argument
but OTOH, all this can also be achieved with Xen, so we
could as well bring the argument, why have network namespaces
at all, if you can get the same functionality (including the
migration) with a Xen domU ...

  What is this new paradigm you are talking about ?
 
 The basic point is this.  The less like stock linux the inside of a
 container looks, and the more of a special case it is the more
 confusing it is.  The classic example is that for a system container
 routing packets between containers over the loopback interface is
 completely unexpected.

I disagree here, from the point of isolation that would be
the same as saying:

  having a chroot(), it is completely unexpected that
   the files reside on the same filesystem and even will
   be cached in the same inode cache

the thing is, once you depart from the 'container' = 'box'
idea, and accept that certain resources are shared (btw,
one of the major benefits of 'containers' over things like
Xen or UML) you can easily accept that:

 - host local traffic uses loopback
 - non local traffic uses the appropriate interfaces
 - guests _are_ local on the host, so 
 - guest - guest and guest - host traffic _is_ local
   an therefore will be more performant than remote
   traffic (unless you add various virtual switches and
   bridges and stacks to the pathes)

  There is not extra networking data structure instantiation in the
  Daniel's L3.
 Nope just an extra field which serves the same purpose.
 
  - Bind/Connect/Accept filtering.  There are so few places in
the code this is easy to maintain without sharing code with
everyone else.
 
  For isolation too ? Can we build network migration on top of that ?
 
 As long as you can take your globally visible network address 
 with you when you migrate you can build network migration on 
 top of it.  So yes bind/accept filtering is sufficient to 
 implement migration, if you are only using IP based protocols.

correct, don't get me wrong, I'm absolutely not against
layer 2 virtualization, but not at the expense of light-
weight layer 3 isolation, which _is_ the traditional way
'containers' are built (see BSD, solaris ...)

HTC,
Herbert

 Eric
 ___
 Containers mailing list
 [EMAIL PROTECTED]
 https://lists.osdl.org/mailman/listinfo/containers
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-11-28 Thread Daniel Lezcano

Eric W. Biederman wrote:
 I do not want to get into a big debate on the merits of various
 techniques at this time.  We seem to be in basic agreement
 about what we are talking about.

 There is one thing I think we can all agree upon.
 - Everything except isolation at the network device/L2 layer, does not
   allow guests to have the full power of the linux networking stack.
Agree.

 - There has been a demonstrated use for the full power of the linux
   networking stack in containers..
Agree.

 - There are a set of techniques which look as though they will give
   us full speed when we do isolation of the network stack at the
   network device/L2 layer.
Agree.

 Is there any reason why we don't want to implement network namespaces
 without the full power of the linux network stack?
Don't make me wrong, I never said layer 2 should not be used. I am only
arguing a layer 3 should use the mechanism provided by the layer 2 and
use a subset of it like the sockets virtualization/isolation.

Just IP isolation for lightweight containers, applications containers in
order to have mobility.

 If there is a case where we clearly don't want the full power of the
 linux network stack in a guest but we still need a namespace we can
 start looking at the merits of the alternatives.
Dmitry and I, we are looking for a l3 based on a subset of the l2 and
according with Herbert needs.
If we can provide a l3 isolation based on the l2 which:
 - does not collide with l2
 - fit the needs of Herbert
 - allows the migration
 - use common code between l2 and l3
Should it not be sufficient to justify to have a l3 with the l2 isolation ?

 What is this new paradigm you are talking about ?

 The basic point is this.  The less like stock linux the inside of a
 container looks, and the more of a special case it is the more
 confusing it is.  The classic example is that for a system container
 routing packets between containers over the loopback interface is
 completely unexpected.

Right for system container, but not necessary for application containers.


 There is not extra networking data structure instantiation in the
 Daniel's L3.
 Nope just an extra field which serves the same purpose.

 - Bind/Connect/Accept filtering.  There are so few places in
   the code this is easy to maintain without sharing code with
   everyone else.
 For isolation too ? Can we build network migration on top of that ?

 As long as you can take your globally visible network address with you
 when you migrate you can build network migration on top of it.  So yes
 bind/accept filtering is sufficient to implement migration, if you are
 only using IP based protocols.

When you migrate an application, you must cleanup related sockets on the
source machine. The cleanup can not rely on the IP addresses because you
will be not able to discriminate all the sockets related to the
container. Another stuff is the network objects life-cycle, the
container will die when the application will finish, the timewait
sockets will stay until all data are flushed to peer. You can not
restart a new container with the same IP address, so you need to monitor
the socket before relaunching a new container or unmounting the aliased
interface associated with the container. You need a ref counting for the
container and this refcount is exactly what has the network namespace.
Another example, you can not have several application binding to
INADDR_ANY:port without conflict. The multiport instantiation is exactly
what brings the sockets isolation/virtualization with the l2/l3.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-11-28 Thread Herbert Poetzl
On Tue, Nov 28, 2006 at 02:50:03PM -0700, Eric W. Biederman wrote:
 Daniel Lezcano [EMAIL PROTECTED] writes:
 
  Eric W. Biederman wrote:
  I do not want to get into a big debate on the merits of various
  techniques at this time.  We seem to be in basic agreement
  about what we are talking about.
 
  There is one thing I think we can all agree upon.
  - Everything except isolation at the network device/L2 layer, does not
allow guests to have the full power of the linux networking stack.
  Agree.
 
  - There has been a demonstrated use for the full power of the linux
networking stack in containers..
  Agree.
 
  - There are a set of techniques which look as though they will give
us full speed when we do isolation of the network stack at the
network device/L2 layer.
  Agree.
 
 Herbert Poetzl [EMAIL PROTECTED] writes:
  correct, don't get me wrong, I'm absolutely not against
  layer 2 virtualization, but not at the expense of light-
  weight layer 3 isolation, which _is_ the traditional way
  'containers' are built (see BSD, solaris ...)

 Ok. So on this point we agree. Full isolation at the network device/L2
 level is desirable and no one is opposed to that.

 There is however a strong feeling especially for the case of
 application containers that something more focused on what a
 non-privileged process can use and deal with would be nice. 
 The ``L3'' case.

 I agree that has potential but I worry about 2 things.
 - Premature optimization.
 - A poor choice of semantics.
 - Feature creep leading to insane semantics.

 I feel there is something in the L3 arguments as well and it sounds
 like it would be a good idea to flush out the semantics.

 For full network isolation we have the case that every process, 
 every socket, and every network device belongs to a network namespace.

 This is enough to derive the network namespace for all other user
 visible data structures, and to a large extent to define their
 semantics.

 We still need a definition of the non-privileged case, that is 
 compatible with the former definition.

yep, sounds interesting ...

 .
 
 What unprivileged user space gets to manipulate are sockets. 
 So perhaps we can break our model into a network socket namespace
 and network device namespace.

 I would define it so that for each socket there is exactly one 
 network socket namespace. And for each network socket namespace 
 there is exactly one network device namespace.

 The network socket namespace would be concerned with the rules for
 deciding which local addresses a socket can connect/accept/bind to.

 The network device namespace would be concerned with everything else.

hmm, guess I've read the word 'semantics' so many times
now, and always in conjunction with insane and unexpected,
so I think it can't hurt to explain the semantics behind
what we currently use once again, maybe I'm missing something

first, what we currently do:

 - a network context consists of a bunch of flags, and
   a set of ip addresses 

 - a process is either part of exactly one such context
   or unrestricted

 - at bind() time, collisions are checked (in the * case)
   and addresses are verified against the assigned set
   
 - at lookup() time, addresses are checked against the
   assigned set (again in the * case)

 - for queries, addresses are checked against the set, and
   if the address is found, the corresponding device will
   be visible (basically per address)

 - for guest originating traffic, the src address will
   be picked from the set, where the first assigned IP
   is handled special as 'last resort' if no better one
   can be found

here now the semantics:

 - bind() can be done for all IP/port pairs which do not
   conflict with existing sockets 
   [identical to the current behaviour]

 - bind() to * is handled like a bind() for each address
   in the assigned set of IPs (if one fails, then the
   entire bind will fail)
   [identical behaviour, subset]

 - lookup() will only match sockets which match the address
   where * now means any IP from the IP set
   [identical behaviour, subset]

 - the source address has to reside within the IP set for
   outgoing traffic

 - netinfo/proc are filtered according to the rule
   address in set - show address/interface

except for the last one, the behaviour is identical to the
current linux networking behaviour. the hiding of unavailable
interfaces/addresses is a virtualization mechanism we use to
make it look like a separate box, which is sometimes necessary
for certain applications and humans :)

 The problem I see are the wild card binds.  In general unmodified
 server applications want to bind to *:port by default.  Running
 two such applications on different ip addresses is a problem.  Even
 if you can configure them not to do that it becomes easy to do that
 be default.

those are working and running perfectly fine, the only
time you have to take care of such applications is when you
start them outside any isolation 

Re: Network virtualization/isolation

2006-11-28 Thread Herbert Poetzl
On Tue, Nov 28, 2006 at 09:26:52PM +0100, Daniel Lezcano wrote:
 Eric W. Biederman wrote:
   I do not want to get into a big debate on the merits of various
   techniques at this time.  We seem to be in basic agreement
   about what we are talking about.
  
   There is one thing I think we can all agree upon.
   - Everything except isolation at the network device/L2 layer, does not
 allow guests to have the full power of the linux networking stack.
 Agree.
  
   - There has been a demonstrated use for the full power of the linux
 networking stack in containers..
 Agree.
  
   - There are a set of techniques which look as though they will give
 us full speed when we do isolation of the network stack at the
 network device/L2 layer.
 Agree.
 
   Is there any reason why we don't want to implement network namespaces
   without the full power of the linux network stack?
 Don't make me wrong, I never said layer 2 should not be used. I am only
 arguing a layer 3 should use the mechanism provided by the layer 2 and
 use a subset of it like the sockets virtualization/isolation.
 
 Just IP isolation for lightweight containers, applications containers in
 order to have mobility.
 
   If there is a case where we clearly don't want the full power of the
   linux network stack in a guest but we still need a namespace we can
   start looking at the merits of the alternatives.
 Dmitry and I, we are looking for a l3 based on a subset of the l2 and
 according with Herbert needs.
 If we can provide a l3 isolation based on the l2 which:
   - does not collide with l2
   - fit the needs of Herbert
   - allows the migration
   - use common code between l2 and l3
 Should it not be sufficient to justify to have a l3 with the l2 
 isolation?

sounds good to me ...

   What is this new paradigm you are talking about ?
  
   The basic point is this.  The less like stock linux the inside of a
   container looks, and the more of a special case it is the more
   confusing it is.  The classic example is that for a system container
   routing packets between containers over the loopback interface is
   completely unexpected.
 
 Right for system container, but not necessary for application containers.

yep

best,
Herbert

   There is not extra networking data structure instantiation in the
   Daniel's L3.
   Nope just an extra field which serves the same purpose.
  
   - Bind/Connect/Accept filtering.  There are so few places in
 the code this is easy to maintain without sharing code with
 everyone else.
   For isolation too ? Can we build network migration on top of that ?
 
   As long as you can take your globally visible network address with you
   when you migrate you can build network migration on top of it.  So yes
   bind/accept filtering is sufficient to implement migration, if you are
   only using IP based protocols.
 
 When you migrate an application, you must cleanup related sockets on the
 source machine. The cleanup can not rely on the IP addresses because you
 will be not able to discriminate all the sockets related to the
 container. Another stuff is the network objects life-cycle, the
 container will die when the application will finish, the timewait
 sockets will stay until all data are flushed to peer. You can not
 restart a new container with the same IP address, so you need to monitor
 the socket before relaunching a new container or unmounting the aliased
 interface associated with the container. You need a ref counting for the
 container and this refcount is exactly what has the network namespace.
 Another example, you can not have several application binding to
 INADDR_ANY:port without conflict. The multiport instantiation is exactly
 what brings the sockets isolation/virtualization with the l2/l3.
 
 ___
 Containers mailing list
 [EMAIL PROTECTED]
 https://lists.osdl.org/mailman/listinfo/containers
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-11-26 Thread Herbert Poetzl
On Sat, Nov 25, 2006 at 01:21:39AM -0700, Eric W. Biederman wrote:
 
 jamal [EMAIL PROTECTED] writes:
 
  On Fri, 2006-27-10 at 11:10 +0200, Daniel Lezcano wrote:
 
  No, it uses virtualization at layer 2 and I had already mention it 
  before (see the first email of the thread), but thank you for the email 
  thread pointer.
 
 
  What would be really useful is someone takes the time and creates a
  matrix of the differences between the implementations.
  It seems there are quiet a few differences but without such comparison
  (to which all agree to) it is hard to form an opinion without a document
  of some form.
   
  For one, I am puzzled by the arguements about L2 vs L3 - Is this the
  host side or inside the VE?
 
  If it is a discussion of the host side:
  To me it seems it involves the classification of some packet header
  arriving on a physical netdevice on the host side (irrelevant whether
  they are L2 or L7) and reaching a decision to select some redirected to
  virtual netdevice.
 
 There are two techniques in real use.
 - Bind/Accept filtering 
 
   Which layer 3 addresses a socket can bind/accept are filtered,
   but otherwise the network stack remains unchanged. When your
   container/VE only has a single IP address this works great. 

   When you get multiple IPs this technique starts to fall down because
   it is not obvious how to make this correctly handle wild card ip
   addresses. 

not really, you have to check for a (sub)set of IPs
that's quite simple and not hard to get right, I agree
that it increases the overhead on those checks, but 
this only occurs at bind/connect/accept time ...

   This technique also falls down because it is very hard to
   support raw IP packets.

raw ip packets could be header checked in the same way
but in general, this type of isolation does not make
too much sense for raw ip purposes ...

   The major advantage of this approach is that it is insanely simple
   and cheap.  

and don't forget flexible, because it also alows a few
things not quite easily done with layer2 techniques
(see below)

   When the discussion started this is what I called Layer 3, bind
   filtering is probably a more precise name.
 
 - Network object tagging
 
   Every network device, each socket, each routing table, each net
   filter table, everything but the packets themselves is associated
   with a single VE/container.  In principle the network stack doesn't
   change except everything that currently access global variables gets
   an additional pointer indirection.
 
   To find where a packet is you must look at it's network device on
   ingress, and you must look at it's socket on egress.
 
   This allows capabilities like CAP_NET_ADMIN to be fairly safely
   given to people inside a container without problems.
 
   There are two basic concerns here. 
   1) This is a lot of code that needs to be touched.
   2) There are not enough physical network devices to go around so
  we need something that maps packets coming in a physical network
  device into multiple virtual network devices.
 
  The obvious way to do this mapping is with either ethernet
  bridging or with the linux routing code if the external network
  is not ethernet, and some tunnel devices between the VE and the
  host environment.  This allows firewalling and in general the
  full power of the linux network stack.
 
  The concern is that the extra trip through the network stack adds
  overhead.
 
   This is the technique we have been calling layer two because it
   works below the IP layer and as such should work for everything.
 
 There have been some intermediate solutions considered but generally
 the have the down sides of additional expense without the upsides of
 more power.
 
  The admin (on the host) decides what packets any VE can see. 
  Once within the VE, standard Linux net stack applies. The same applies
  on the egress. The admin decides what packets emanating from the VE
  go where.
  I dont think this is a simple L2 vs L3. You need to be able to process
  IP as well as Decnet[1]
 
 Except for some implementation details I don't think we have
 disagreement about what we are talking about although there is
 certainly a little confusion.  The true issue is can we implement
 something that places the full power of the network stack (including
 things like creating virtual tunnel devices) into the container/VE
 without sacrificing performance.
 
 I think we could all agree on the most general technique if we could
 convince ourselves the overhead was unmeasurable.

and we do not give away features like:

 - sharing an IP between different guests, so that
   services can bind to the same IP as long as they
   do not collide

 - allow simple wrapping (ala chroot()) for processes
   without requiring a complete routing/network setup
   plus a virtual switch/router/whatever

 Given that performance is the primary concern this is something a
 network stack expert 

Re: Network virtualization/isolation

2006-11-26 Thread Ben Greear

Herbert Poetzl wrote:

On Sat, Nov 25, 2006 at 01:21:39AM -0700, Eric W. Biederman wrote:
  
Then the question is how do we reduce the overhead when we don't have

enough physical network interfaces to go around.  My feeling is that
we could push the work to the network adapters and allow single
physical network adapters to support multiple network interfaces, each
with a different link-layer address.  



that would be something interesting, but again, the
number of nics allowing for an arbitrary number of
filters, which also can be identified/correlated to
the network context without adding even more overhead
is probably insignificant ... so IMHO that would:

 - keep all interfaces in promisc mode
 - check each packet for the set of MACs

as the checks would require to identify the interface,
that would immediately result in O(N) overhead for
each packet received, plus the overhead added by
disabling the hardware filters ... but maybe that
changed over the years, I'm definitely no network
stack/device expert ...
  
This can be implemented similar to how MAC-VLANs are currently done (in 
my out-of-tree patch).
There is a performance hit with lots of virtual interfaces (maybe 10% in 
some cases), but this is still

greater than 500Mbps full-duplex on 2 ports on a modern dual-core machine.

I don't even have hashing implemented, but it could be easily added and 
that should
significantly decrease the search time from O(n) to something 
approaching O(1)

in the normal case.

This should also be an easy feature for NICs to add, and just as with 
802.1Q VLANs, when
hardware support is available, the features can migrate into the NIC, 
with the software

mac-vlan logic handling generic hardware.

In a switched environment, going into PROMISC mode should not add any
significant overhead..

Ben

--
Ben Greear [EMAIL PROTECTED] 
Candela Technologies Inc  http://www.candelatech.com



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-11-26 Thread Eric W. Biederman
Herbert Poetzl [EMAIL PROTECTED] writes:

 On Sat, Nov 25, 2006 at 01:21:39AM -0700, Eric W. Biederman wrote:

 There are two techniques in real use.
 - Bind/Accept filtering 
 
   Which layer 3 addresses a socket can bind/accept are filtered,
   but otherwise the network stack remains unchanged. When your
   container/VE only has a single IP address this works great. 

   When you get multiple IPs this technique starts to fall down because
   it is not obvious how to make this correctly handle wild card ip
   addresses. 

 not really, you have to check for a (sub)set of IPs
 that's quite simple and not hard to get right, I agree
 that it increases the overhead on those checks, but 
 this only occurs at bind/connect/accept time ...

The general problem is you get into mental model problems.  You think
you are isolated but you don't realize you can route packets over the
loopback interface for example.  But with care yes you can solve it.

However while I think there is value in this technique it doesn't
solve any of my problems, nor do I think it can be easily stretched
to solve my problems.  My gut feel for implementation still says
this should be a new netfilter table that filters binds and accepts
if we implement this.

For most of us we need more power than we can get with the simple
bind/accept filtering so we I think the network namespace work should
concentrate on the general technique that gives us the entire power of
the current network stack.  At least until we have proved the
overheads are unacceptable.

 Given that performance is the primary concern this is something a
 network stack expert might be able to help with.  My gut feel is
 the extra pointer indirection for the more general technique is
 negligible and will not affect the network performance.  The network
 stack can be very sensitive to additional cache misses so I could be
 wrong.  Opinions?

 well, here we are talking about layer2 _isolation_
 if I got that right, i.e. you split the physical
 interfaces up into separate network namespaces, which
 then can make full use of the assigned interfaces

Yes.  Layer 2 isolation is a good description.

 this is something which I'm perfectly fine with, as
 I do not think it adds significant overhead (nevertheless
 it needs some testing) 
Yes lots of testing and careful implementation.

 but at the same time, this is
 something which isn't very useful in the generic case,
 where folks will have, let's say two network interfaces
 and want to share one of them between 100 guests ...

It is useful in the generic case.  It just requires being smart to
keep the overheads down.

 as the checks would require to identify the interface,
 that would immediately result in O(N) overhead for
 each packet received, plus the overhead added by
 disabling the hardware filters ... but maybe that
 changed over the years, I'm definitely no network
 stack/device expert ...

Getting this to O(log(N)) is easy, and you can probably
get the average case to O(1) without trying too hard.  This
is no worse than routing tables or multiple IP addresses on
a single interface.  Ben Greear has addressed this.  His experience
suggest that even O(N) is not likely to be a significant problem.


Now I'm going to go bury my head in the sand for a bit.  The hard
problems are not how do we reshape the network stack but how do we
get the appropriate context into all of our user space interfaces.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Network virtualization/isolation

2006-11-25 Thread Leonid Grossman
 

 -Original Message-
 From: Eric W. Biederman [mailto:[EMAIL PROTECTED] 
 Sent: Saturday, November 25, 2006 11:27 AM
 To: Leonid Grossman
 Cc: [EMAIL PROTECTED]; Daniel Lezcano; Dmitry Mishin; Stephen 
 Hemminger; netdev@vger.kernel.org; Linux Containers
 Subject: Re: Network virtualization/isolation
 
 Leonid Grossman [EMAIL PROTECTED] writes:
 
   
 
  -Original Message-
  From: [EMAIL PROTECTED]
  [mailto:[EMAIL PROTECTED] On Behalf Of Eric W. 
 Biederman
 
  Then the question is how do we reduce the overhead when we 
 don't have 
  enough physical network interfaces to go around.
  My feeling is that we could push the work to the network 
 adapters and 
  allow single physical network adapters to support multiple network 
  interfaces, each with a different link-layer address.  At 
 which point 
  the overhead is nearly nothing and newer network adapters 
 may start 
  implementing enough filtering in hardware to do all of the 
 work for 
  us.
 
  Correct, to a degree. 
  There will be always a limit on the number of physical 
 channels that 
  a NIC can support, while keeping these channels fully 
 independent and 
  protected at the hw level.
  So, you will probably still need to implement the sw path, with the 
  assumption that some containers (that care about 
 performance) will get 
  a separate NIC interface and avoid the overhead, and other 
 containers 
  will have to use the sw path.
  There are some multi-channel NICs shipping today so it would be 
  possible to see the overhead between the two options (I suspect it 
  will be quite noticeable), but for a general idea about what work 
  could be pushed down to network adapters in the near future you can 
  look at the pcisig.com I/O Virtualization Workgroup.
  Once the single root I/O Virtualization spec is completed, it is 
  likely to be supported by several NIC vendors to provide multiple 
  network interfaces on a single NIC that you are looking for.
 
 Pushing it all of the way into the hardware is an 
 optimization, that while great is likely not necessary.  
 Simply doing a table lookup by link-level address and 
 selecting between several network interfaces is enough to 
 ensure we only traverse the network stack once.
 
 To keep overhead down in the container case I don't need the 
 hardware support to be so good you can do kernel bypass and 
 still trust that everything is safe.  I simply a fast 
 link-level address to container mapping.  We already look at 
 the link-level address on every packet received so that 
 should not generate any extra cache misses.

I did not mean kernel bypass, just L2 hw channels that for 
all practical purposes act as separate NICs - 
different MAC addresses, no blocking, independent reset, etc.

 
 In the worst case I might need someone to go as far as the 
 Grand Unified Lookup to remove all of the overheads.  Except 
 for distributing the work load more evenly across the machine 
 with separate interrupts and the like I see no need for 
 separate hardware channels to make things go fast for my needs.
 
 Despite the title of this thread there is no virtualization 
 or emulation of the hardware involved.  Just enhancements to 
 the existing hardware abstractions.

Right, I was just trying to say that IOV support (likely, from multiple
vendors since
virtualization is expected to be widely used) would provide an option to
export multiple
independent L2 interfaces from a single NIC - even if only a subset of 
IOV functionality would be used in this case.

 
 Eric
 

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-11-25 Thread Eric W. Biederman
Leonid Grossman [EMAIL PROTECTED] writes:

 I did not mean kernel bypass, just L2 hw channels that for 
 all practical purposes act as separate NICs - 
 different MAC addresses, no blocking, independent reset, etc.

Yes. Nearly all of what you need for safe kernel bypass.

 In the worst case I might need someone to go as far as the 
 Grand Unified Lookup to remove all of the overheads.  Except 
 for distributing the work load more evenly across the machine 
 with separate interrupts and the like I see no need for 
 separate hardware channels to make things go fast for my needs.
 
 Despite the title of this thread there is no virtualization 
 or emulation of the hardware involved.  Just enhancements to 
 the existing hardware abstractions.

 Right, I was just trying to say that IOV support (likely, from multiple
 vendors since
 virtualization is expected to be widely used) would provide an option to
 export multiple
 independent L2 interfaces from a single NIC - even if only a subset of 
 IOV functionality would be used in this case.

Agreed, and I think I understood that.  My basic point was that it
doesn't look to me like I need the hardware support, just that I can
use it when it is there.

The core advantage I see of the multiple queues, is in being able to
split the processing of network traffic and interrupts among multiple
cores.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-11-14 Thread Daniel Lezcano

Then a matrix of how each requires what modifications in the network
code. Of course all players need to agree that the description is
accurate.
Is there such a document?

cheers,
jamal


Hi,

the attached document describes the network isolation at the layer 2 and 
at the layer 3, it presents the pros and cons of the different 
approaches, their common points and the impacted network code.

I hope it will be helpful :)

Cheers.

   -- Daniel




Isolating and virtualizing the network
--

Some definitions:
-

isolation : This is a restrictive technique which divides a set of the
available system objects to smaller subsets assigned to a group of
processes. This technique ensures an application will use only a
subset of the system resources and will never access other
resources.

virtualization : This technique gives the illusion to an application
that its owns all the system resources instead of a subset of them
provided by the isolation.

container: it is the name of the base element which brings the
isolation and the virtualization where applications are running into.

system container: operating system running inside a container.

application container : application running inside a container.

checkpoint/restart: take a snapshot of a container at a given time
and recreate the container from this snapshot.

mobility: checkpoint/restart used to move a container to one host to
another host.



Actually, containers are being developed in the kernel with the
following functions :

  * separate the system resources between containers in order
to avoid an application, running into a container, to
access the resources outside the container. That
facilitates the resources management, ensures the
application is jailed and increases the security.

  * virtualize the resources, that avoids resources conflict
between containers, that allows to run several instance of
the same servers without modifying its network
configuration.

  * the combination of the isolation and the virtualization is
the base for the checkpoint/restart. The checkpoint is
easier because the resources are identified by container
and the restart is possible because the applications can
be recreated with the same resources identifier without
conflicts. For example, the application has the pid 1000,
it is checkpointed and when it is restarted the same pid
is assigned to it and it will not conflict because pids are
isolated and virtualized.

In all the system resources, the network is one of the biggest part
to isolate and virtualize. Some solutions were proposed, with
different approaches and different implementations.

Layer 2 isolation and virtualization


The virtualization acts at the network device level. The routes and
the sockets are isolated. Each container has its own network device
and its own routes. The network must be configured in each container.

This approach brings a very strong isolation and a perfect
virtualization for the system containers.


 - Ingress traffic

The packets arrive to the real network device, outside of the
container. Depending on the destination, the packets are forwarded to
the network device assigned to the container. From this point, the
path is the same and the packets go through the routes and the sockets
layer because they are isolated into the container.


 - Outgoing traffic

The packets go through the sockets, the routes, the network device
assigned to the container and finally to the real device.


Implementation:
---

Andrey Savochkin, from OpenVZ team, patchset of this approach uses the
namespace concept.  All the network devices are no longer stored into
the dev_base_list but into a list stored into the network namespace
structure. Each container has its own network namespace. The network
device access has been changed to access the network device list
relative to the current namespace's context instead of the global
network device list. The same has been made for the routing tables,
they are all relatives to the namespace and are no longer global
static. The creation of a new network namespace implies the creation
of a new set of routing table.

After the creation of a container, no network device exists. It is
created from outside by the container's parent. The communication
between the new container and the outside is done via a special pair
device which have each extremities into each namespace. The MAC
addresses must be specified and these addresses should be handled by
the containers developers in order to ensure MAC unicity.

After this network device creation step into each namespace, the
network configuration is done as usual, in other words, with a new

Re: Network virtualization/isolation

2006-11-14 Thread James Morris
On Tue, 14 Nov 2006, Daniel Lezcano wrote:

 the attached document describes the network isolation at the layer 2 and at
 the layer 3, it presents the pros and cons of the different approaches, their
 common points and the impacted network code.
 I hope it will be helpful :)

What about other network subsystems: xfrm, netfilter, iptables, netlink, 
etc. ?



- James
-- 
James Morris
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-11-01 Thread jamal
On Fri, 2006-27-10 at 11:10 +0200, Daniel Lezcano wrote:

 No, it uses virtualization at layer 2 and I had already mention it 
 before (see the first email of the thread), but thank you for the email 
 thread pointer.


What would be really useful is someone takes the time and creates a
matrix of the differences between the implementations.
It seems there are quiet a few differences but without such comparison
(to which all agree to) it is hard to form an opinion without a document
of some form.
 
For one, I am puzzled by the arguements about L2 vs L3 - Is this the
host side or inside the VE?

If it is a discussion of the host side:
To me it seems it involves the classification of some packet header
arriving on a physical netdevice on the host side (irrelevant whether
they are L2 or L7) and reaching a decision to select some redirected to
virtual netdevice.
The admin (on the host) decides what packets any VE can see. 
Once within the VE, standard Linux net stack applies. The same applies
on the egress. The admin decides what packets emanating from the VE
go where.
I dont think this is a simple L2 vs L3. You need to be able to process
IP as well as Decnet[1]

If it is a discussion on the VE side, then it needs to start at the
netdevice for simplicity IMO.

Anyways without background material on the different religions it is
hard to reach a compromise - so it would be useful for someone to write
a simple (not more than 4 pages total) background on the different
approaches. For example, you could address the following:
- how each treats packets coming in at the host level
- what happens on incoming to the VE
- what happens on egress of VE
- how each treats packets departing from the host
- how each treats inter-VE communication

Then a matrix of how each requires what modifications in the network
code. Of course all players need to agree that the description is
accurate.
Is there such a document?

cheers,
jamal

[1] Since Linux has the only SMP-capable, firewall-capable Decnet
implementation - wouldnt it be fun to have it be virtualized as
well? ;-

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-11-01 Thread Daniel Lezcano

What would be really useful is someone takes the time and creates a
matrix of the differences between the implementations.
It seems there are quiet a few differences but without such comparison
(to which all agree to) it is hard to form an opinion without a document
of some form.


If Dmitry is agree, we can work on the document.
Dmitry, is it ok for you ?

  -- Daniel



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-10-27 Thread Dmitry Mishin
On Thursday 26 October 2006 19:56, Stephen Hemminger wrote:
 On Thu, 26 Oct 2006 11:44:55 +0200

 Daniel Lezcano [EMAIL PROTECTED] wrote:
  Stephen Hemminger wrote:
   On Wed, 25 Oct 2006 17:51:28 +0200
  
   Daniel Lezcano [EMAIL PROTECTED] wrote:
  Hi Stephen,
  
  currently the work to make the container enablement into the kernel is
  doing good progress. The ipc, pid, utsname and filesystem system
  ressources are isolated/virtualized relying on the namespaces concept.
  
  But, there is missing the network virtualization/isolation. Two
  approaches are proposed: doing the isolation at the layer 2 and at the
  layer 3.
  
  The first one instanciate a network device by namespace and add a peer
  network device into the root namespace, all the routing ressources
   are relative to the namespace. This work is done by Andrey Savochkin
   from the openvz project.
  
  The second relies on the routes and associates the network namespace
  pointer with each route. When the traffic is incoming, the packet
  follows an input route and retrieve the associated network namespace.
  When the traffic is outgoing, the packet, identified from the network
  namespace is coming from, follows only the routes matching the same
  network namespace. This work is made by me.
  
  IMHO, we need the two approach, the layer-2 to be able to bring *very*
  strong isolation for system container with a performance cost and a
  layer-3 to be able to have good isolation for lightweight container or
  application container when performances are more important.
  
  Do you have some suggestions ? What is your point of view on that ?
  
  Thanks in advance.
  
 -- Daniel
  
   Any solution should allow both and it should build on the existing
   netfilter infrastructure.
 
  The problem is netfilter can not give a good isolation, eg. how can be
  handled netstat command ? or avoid to see IP addresses assigned to
  another container when doing ifconfig ? Furthermore, one of the biggest
  interest of the network isolation is to bring mobility with a container
  and that can only be done if the network ressources inside the kernel
  can be identified by container in order to checkpoint/restart them.
 
  The all-in-namespace solution, ie. at layer 2, is very good in terms of
  isolation but it adds an non-negligeable overhead. The layer 3 isolation
has an insignifiant overhead, a good isolation perfectly adapted for
  applications containers.
 
  Unfortunatly, from the point of view of implementation, layer 3 can not
  be a subset of layer 2 isolation when using all-in-namespace and layer
  2 isolation can not be a extension of the layer 3 isolation.
 
  I think the layer 2 and the layer 3 implementations can coexists. You
  can for example create a system container with a layer 2 isolation and
  inside it add a layer 3 isolation.
 
  Does that make sense ?
 
  -- Daniel

 Assuming you are talking about pseudo-virtualized environments,
 there are several different discussions.

 1. How should the namespace be isolated for the virtualized containered
applications?

 2. How should traffic be restricted into/out of those containers. This
is where existing netfilter, classification, etc, should be used.
The network code is overly rich as it is, we don't need another
abstraction.

 3. Can the virtualized containers be secure? No. we really can't keep
hostile root in a container from killing system without going to
a hypervisor.
Stephen, 

Virtualized container can be secure, if it is complete system virtualization, 
not just an application container. OpenVZ implements such and it is used hard 
over the world. And of course, we care a lot to keep hostile root from
killing whole system.
 
OpenVZ uses virtualization on IP level (implemented by Andrey Savochkin, 
http://marc.theaimsgroup.com/?l=linux-netdevm=115572448503723), with all
necessary network objects isolated/virtualized, such as sockets, devices, 
routes, netfilters, etc.

-- 
Thanks,
Dmitry.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-10-27 Thread Daniel Lezcano


[ ... ]

Dmitry Mishin wrote:
Stephen, 

Virtualized container can be secure, if it is complete system virtualization, 
not just an application container. OpenVZ implements such and it is used hard 
over the world. And of course, we care a lot to keep hostile root from

killing whole system.


OpenVZ power !!

OpenVZ uses virtualization on IP level (implemented by Andrey Savochkin, 
http://marc.theaimsgroup.com/?l=linux-netdevm=115572448503723), with all
necessary network objects isolated/virtualized, such as sockets, devices, 
routes, netfilters, etc.


No, it uses virtualization at layer 2 and I had already mention it 
before (see the first email of the thread), but thank you for the email 
thread pointer.


The discussion is not to convince Stephen that layer 2 or layer 3 is the 
best but to present the pros and the cons of each solution and to have a 
point of view from a network gourou guy.


Regards.

-- Daniel




-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-10-26 Thread Daniel Lezcano

Stephen Hemminger wrote:

On Wed, 25 Oct 2006 17:51:28 +0200
Daniel Lezcano [EMAIL PROTECTED] wrote:



Hi Stephen,

currently the work to make the container enablement into the kernel is 
doing good progress. The ipc, pid, utsname and filesystem system 
ressources are isolated/virtualized relying on the namespaces concept.


But, there is missing the network virtualization/isolation. Two 
approaches are proposed: doing the isolation at the layer 2 and at the 
layer 3.


The first one instanciate a network device by namespace and add a peer 
network device into the root namespace, all the routing ressources are 
  relative to the namespace. This work is done by Andrey Savochkin from 
the openvz project.


The second relies on the routes and associates the network namespace 
pointer with each route. When the traffic is incoming, the packet 
follows an input route and retrieve the associated network namespace. 
When the traffic is outgoing, the packet, identified from the network 
namespace is coming from, follows only the routes matching the same 
network namespace. This work is made by me.


IMHO, we need the two approach, the layer-2 to be able to bring *very* 
strong isolation for system container with a performance cost and a 
layer-3 to be able to have good isolation for lightweight container or 
application container when performances are more important.


Do you have some suggestions ? What is your point of view on that ?

Thanks in advance.

  -- Daniel



Any solution should allow both and it should build on the existing netfilter 
infrastructure.




The problem is netfilter can not give a good isolation, eg. how can be 
handled netstat command ? or avoid to see IP addresses assigned to 
another container when doing ifconfig ? Furthermore, one of the biggest 
interest of the network isolation is to bring mobility with a container 
and that can only be done if the network ressources inside the kernel 
can be identified by container in order to checkpoint/restart them.


The all-in-namespace solution, ie. at layer 2, is very good in terms of 
isolation but it adds an non-negligeable overhead. The layer 3 isolation 
 has an insignifiant overhead, a good isolation perfectly adapted for 
applications containers.


Unfortunatly, from the point of view of implementation, layer 3 can not 
be a subset of layer 2 isolation when using all-in-namespace and layer 
2 isolation can not be a extension of the layer 3 isolation.


I think the layer 2 and the layer 3 implementations can coexists. You 
can for example create a system container with a layer 2 isolation and 
inside it add a layer 3 isolation.


Does that make sense ?

-- Daniel




-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-10-26 Thread Stephen Hemminger
On Thu, 26 Oct 2006 11:44:55 +0200
Daniel Lezcano [EMAIL PROTECTED] wrote:

 Stephen Hemminger wrote:
  On Wed, 25 Oct 2006 17:51:28 +0200
  Daniel Lezcano [EMAIL PROTECTED] wrote:
  
  
 Hi Stephen,
 
 currently the work to make the container enablement into the kernel is 
 doing good progress. The ipc, pid, utsname and filesystem system 
 ressources are isolated/virtualized relying on the namespaces concept.
 
 But, there is missing the network virtualization/isolation. Two 
 approaches are proposed: doing the isolation at the layer 2 and at the 
 layer 3.
 
 The first one instanciate a network device by namespace and add a peer 
 network device into the root namespace, all the routing ressources are 
relative to the namespace. This work is done by Andrey Savochkin from 
 the openvz project.
 
 The second relies on the routes and associates the network namespace 
 pointer with each route. When the traffic is incoming, the packet 
 follows an input route and retrieve the associated network namespace. 
 When the traffic is outgoing, the packet, identified from the network 
 namespace is coming from, follows only the routes matching the same 
 network namespace. This work is made by me.
 
 IMHO, we need the two approach, the layer-2 to be able to bring *very* 
 strong isolation for system container with a performance cost and a 
 layer-3 to be able to have good isolation for lightweight container or 
 application container when performances are more important.
 
 Do you have some suggestions ? What is your point of view on that ?
 
 Thanks in advance.
 
-- Daniel
  
  
  Any solution should allow both and it should build on the existing 
  netfilter infrastructure.
  
  
 
 The problem is netfilter can not give a good isolation, eg. how can be 
 handled netstat command ? or avoid to see IP addresses assigned to 
 another container when doing ifconfig ? Furthermore, one of the biggest 
 interest of the network isolation is to bring mobility with a container 
 and that can only be done if the network ressources inside the kernel 
 can be identified by container in order to checkpoint/restart them.
 
 The all-in-namespace solution, ie. at layer 2, is very good in terms of 
 isolation but it adds an non-negligeable overhead. The layer 3 isolation 
   has an insignifiant overhead, a good isolation perfectly adapted for 
 applications containers.
 
 Unfortunatly, from the point of view of implementation, layer 3 can not 
 be a subset of layer 2 isolation when using all-in-namespace and layer 
 2 isolation can not be a extension of the layer 3 isolation.
 
 I think the layer 2 and the layer 3 implementations can coexists. You 
 can for example create a system container with a layer 2 isolation and 
 inside it add a layer 3 isolation.
 
 Does that make sense ?
 
   -- Daniel

Assuming you are talking about pseudo-virtualized environments,
there are several different discussions.

1. How should the namespace be isolated for the virtualized containered
   applications?

2. How should traffic be restricted into/out of those containers. This
   is where existing netfilter, classification, etc, should be used.
   The network code is overly rich as it is, we don't need another
   abstraction.

3. Can the virtualized containers be secure? No. we really can't keep
   hostile root in a container from killing system without going to
   a hypervisor.





-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-10-26 Thread Daniel Lezcano

Stephen Hemminger wrote:

On Thu, 26 Oct 2006 11:44:55 +0200
Daniel Lezcano [EMAIL PROTECTED] wrote:


[ ... ]


Assuming you are talking about pseudo-virtualized environments,
there are several different discussions.


Yes, exact, I forgot to mention that.



1. How should the namespace be isolated for the virtualized containered
   applications?


The network ressources should be related to the namespaces and 
especially the struct sock. So when a checkpoint is initiated for the 
container, you can identify the established connection, the timewait 
socket, the req queues, ... related to the container in order to freeze 
the traffic and checkpoint them.
The IP addresses are not a valid discrimator for identifiying,  for 
example if you have several containers interconnected into the same host.




2. How should traffic be restricted into/out of those containers. This
   is where existing netfilter, classification, etc, should be used.
   The network code is overly rich as it is, we don't need another
   abstraction.


Using only the netfilters you will be not able to bind to the same 
INADDR_ANY,port in different containers. You will need to handle several 
IP addresses coming from IP aliasing and check source address to be sure 
the source address is related to the right container and not from a 
primary interface probably assigned to a different container.


3. Can the virtualized containers be secure? No. we really can't keep
   hostile root in a container from killing system without going to
   a hypervisor.


That is totally true, the containers don't aim to replace 
full-virtualized environment.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Network virtualization/isolation

2006-10-25 Thread Daniel Lezcano

Hi Stephen,

currently the work to make the container enablement into the kernel is 
doing good progress. The ipc, pid, utsname and filesystem system 
ressources are isolated/virtualized relying on the namespaces concept.


But, there is missing the network virtualization/isolation. Two 
approaches are proposed: doing the isolation at the layer 2 and at the 
layer 3.


The first one instanciate a network device by namespace and add a peer 
network device into the root namespace, all the routing ressources are 
  relative to the namespace. This work is done by Andrey Savochkin from 
the openvz project.


The second relies on the routes and associates the network namespace 
pointer with each route. When the traffic is incoming, the packet 
follows an input route and retrieve the associated network namespace. 
When the traffic is outgoing, the packet, identified from the network 
namespace is coming from, follows only the routes matching the same 
network namespace. This work is made by me.


IMHO, we need the two approach, the layer-2 to be able to bring *very* 
strong isolation for system container with a performance cost and a 
layer-3 to be able to have good isolation for lightweight container or 
application container when performances are more important.


Do you have some suggestions ? What is your point of view on that ?

Thanks in advance.

  -- Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-10-25 Thread Stephen Hemminger
On Wed, 25 Oct 2006 17:51:28 +0200
Daniel Lezcano [EMAIL PROTECTED] wrote:

 Hi Stephen,
 
 currently the work to make the container enablement into the kernel is 
 doing good progress. The ipc, pid, utsname and filesystem system 
 ressources are isolated/virtualized relying on the namespaces concept.
 
 But, there is missing the network virtualization/isolation. Two 
 approaches are proposed: doing the isolation at the layer 2 and at the 
 layer 3.
 
 The first one instanciate a network device by namespace and add a peer 
 network device into the root namespace, all the routing ressources are 
relative to the namespace. This work is done by Andrey Savochkin from 
 the openvz project.
 
 The second relies on the routes and associates the network namespace 
 pointer with each route. When the traffic is incoming, the packet 
 follows an input route and retrieve the associated network namespace. 
 When the traffic is outgoing, the packet, identified from the network 
 namespace is coming from, follows only the routes matching the same 
 network namespace. This work is made by me.
 
 IMHO, we need the two approach, the layer-2 to be able to bring *very* 
 strong isolation for system container with a performance cost and a 
 layer-3 to be able to have good isolation for lightweight container or 
 application container when performances are more important.
 
 Do you have some suggestions ? What is your point of view on that ?
 
 Thanks in advance.
 
-- Daniel

Any solution should allow both and it should build on the existing netfilter 
infrastructure.


-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html