Re: [PATCH 2.6.25] netns: struct net content re-work

2007-12-10 Thread Kirill Korotaev
Daniel Lezcano wrote:
 Denis V. Lunev wrote:
 
Recently David Miller and Herbert Xu pointed out that struct net becomes
overbloated and un-maintainable. There are two solutions:
- provide a pointer to a network subsystem definition from struct net.
  This costs an additional dereferrence
- place sub-system definition into the structure itself. This will speedup
  run-time access at the cost of recompilation time

The second approach looks better for us. 
 
 
 Yes, we do not need/want a pointer in this structure and add more 
 dereference in the network code.
 
 
Other sub-systems will be converted
to this approach if this will be accepted :)

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]
---
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index b62e31f..f60e1ce 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -8,6 +8,8 @@
 #include linux/workqueue.h
 #include linux/list.h

+#include net/netns/unix.h
+
 struct proc_dir_entry;
 struct net_device;
 struct sock;
@@ -46,8 +48,7 @@ struct net {
  struct hlist_head   packet_sklist;

  /* unix sockets */
- int sysctl_unix_max_dgram_qlen;
- struct ctl_table_header *unix_ctl;
+ struct netns_unix   unx;
 
 
 Can you change this from unx to unix ?

no, it won't compile. Guess why :)

 If you encapsulate the structure definitions per subsystem, you can drop 
 the unix prefix in the variable declaration.
 
 Instead of having:
   netns-unix-unix_ctl
 you will have:
   netns-unix-ctl

agree.

Kirill
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][net-2.6.25][NETNS] make netns cleanup to run in a separate workqueue

2007-11-19 Thread Kirill Korotaev
Ah, sorry. Didn't notice it's called only on boot.

Acked-By: Kirill Korotaev [EMAIL PROTECTED]

Kirill Korotaev wrote:
 imho panic() is too much.
 create_singlethread_workqueue() can fail e.g. due to out of memory...
 
 Thanks,
 Kirill
 
 
 Daniel Lezcano wrote:
 
Subject: make netns cleanup to run in a separate queue
From: Benjamin Thery [EMAIL PROTECTED]

This patch adds a separate workqueue for cleaning up a network 
namespace. If we use the keventd workqueue to execute cleanup_net(), 
there is a problem to unregister devices in IPv6. Indeed the code 
that cleans up also schedule work in keventd: as long as cleanup_net() 
hasn't return, dst_gc_task() cannot run and as long as dst_gc_task() has
not run, there are still some references pending on the net devices and
cleanup_net() can not unregister and exit the keventd workqueue.

Signed-off-by: Daniel Lezcano [EMAIL PROTECTED]
Signed-off-by: Benjamin Thery [EMAIL PROTECTED]
Acked-by: Denis V. Lunev [EMAIL PROTECTED]
---
 net/core/net_namespace.c |9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

Index: net-2.6.25/net/core/net_namespace.c
===
--- net-2.6.25.orig/net/core/net_namespace.c
+++ net-2.6.25/net/core/net_namespace.c
@@ -58,6 +58,7 @@ out_undo:
 
 #ifdef CONFIG_NET_NS
 static struct kmem_cache *net_cachep;
+static struct workqueue_struct *netns_wq;
 
 static struct net *net_alloc(void)
 {
@@ -149,7 +150,7 @@ void __put_net(struct net *net)
 {
  /* Cleanup the network namespace in process context */
  INIT_WORK(net-work, cleanup_net);
- schedule_work(net-work);
+ queue_work(netns_wq, net-work);
 }
 EXPORT_SYMBOL_GPL(__put_net);
 
@@ -171,7 +172,13 @@ static int __init net_ns_init(void)
  net_cachep = kmem_cache_create(net_namespace, sizeof(struct net),
  SMP_CACHE_BYTES,
  SLAB_PANIC, NULL);
+
+ /* Create workqueue for cleanup */
+ netns_wq = create_singlethread_workqueue(netns);
+ if (!netns_wq)
+ panic(Could not create netns workq);
 
 
 #endif
+
  mutex_lock(net_mutex);
  err = setup_net(init_net);
 




___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers
 
 
 ___
 Containers mailing list
 [EMAIL PROTECTED]
 https://lists.linux-foundation.org/mailman/listinfo/containers
 

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [RFD] L2 Network namespace infrastructure

2007-06-28 Thread Kirill Korotaev
Ben Greear wrote:
 Kirill Korotaev wrote:
 
Patrick McHardy wrote:
  

I believe OpenVZ stores the current namespace somewhere global,
which avoids passing the namespace around. Couldn't you do this
as well?


yes, we store a global namespace context on current
(can be stored in per-cpu as well).

do you prefer this way?
  
 
 For what it's worth, I don't prefer this way as I can see wanting to 
 have one application
 use several namespaces at once

As I wrote to you in another email, it's not a problem,
since applications itself do not own network namespace.

It is objects like sockets which are binded to net namespace and thus
you can have sockets from different namespaces in one application
regardless of mechanism of context handling we talk about.

Thanks,
Kirill

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] L2 Network namespace infrastructure

2007-06-28 Thread Kirill Korotaev
Eric W. Biederman wrote:
 Patrick McHardy [EMAIL PROTECTED] writes:
 
 
Eric W. Biederman wrote:

-- The basic design

There will be a network namespace structure that holds the global
variables for a network namespace, making those global variables
per network namespace.

One of those per network namespace global variables will be the
loopback device.  Which means the network namespace a packet resides
in can be found simply by examining the network device or the socket
the packet is traversing.

Either a pointer to this global structure will be passed into
the functions that need to reference per network namespace variables
or a structure that is already passed in (such as the network device)
will be modified to contain a pointer to the network namespace
structure.


I believe OpenVZ stores the current namespace somewhere global,
which avoids passing the namespace around. Couldn't you do this
as well?
 
 
 It sucks.  Especially in the corner cases.   Think macvlan
 with the real network device in one namespace and the ``vlan''
 device in another device.

sorry, what's the problem here? I don't see a single problem here
related to the global current context.
 
 The implementation of a global is also pretty a little questionable.
 Last I looked it didn't work on the transmit path at all and
 interesting on the receive path.
 
 Further and fundamentally all a global achieves is removing the need
 for the noise patches where you pass the pointer into the various
 functions.  For long term maintenance it doesn't help anything.

this is not 100% true.

Having global context also helps:
1. to account for time spent in network processing or some other activity
   (like IRQ handling) to appropriate namespace.

2. it is really usefull and helpfull for functions like printk() with 
virtualized
   syslog buffer. Otherwise you would need to find a context in every place 
where
   namespace private message should be printed.

   Actually it is the same natural as the 'current' which is global,
   but could be passed from entry.S instead to every function which needs it.

   OVZ experience just tells me that this helps to avoid all this noise
   and not harder to work with then with 'current'. But it's up to you.

Thanks,
Kirill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [RFD] L2 Network namespace infrastructure

2007-06-27 Thread Kirill Korotaev
Patrick McHardy wrote:
 Eric W. Biederman wrote:
 
-- The basic design

There will be a network namespace structure that holds the global
variables for a network namespace, making those global variables
per network namespace.

One of those per network namespace global variables will be the
loopback device.  Which means the network namespace a packet resides
in can be found simply by examining the network device or the socket
the packet is traversing.

Either a pointer to this global structure will be passed into
the functions that need to reference per network namespace variables
or a structure that is already passed in (such as the network device)
will be modified to contain a pointer to the network namespace
structure.
 
 
 
 I believe OpenVZ stores the current namespace somewhere global,
 which avoids passing the namespace around. Couldn't you do this
 as well?

yes, we store a global namespace context on current
(can be stored in per-cpu as well).

do you prefer this way?

Thanks,
Kirill

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [RFD] L2 Network namespace infrastructure

2007-06-27 Thread Kirill Korotaev
Jeff Garzik wrote:
 Eric W. Biederman wrote:
 
Jeff Garzik [EMAIL PROTECTED] writes:


David Miller wrote:

I don't accept that we have to add another function argument
to a bunch of core routines just to support this crap,
especially since you give no way to turn it off and get
that function argument slot back.

To be honest I think this form of virtualization is a complete
waste of time, even the openvz approach.

We're protecting the kernel from itself, and that's an endless
uphill battle that you will never win.  Let's do this kind of
stuff properly with a real minimal hypervisor, hopefully with
appropriate hardware level support and good virtualized device
interfaces, instead of this namespace stuff.

Strongly seconded.  This containerized virtualization approach just bloats up
the kernel for something that is inherently fragile and IMO less secure --
protecting the kernel from itself.

Plenty of other virt approaches don't stir the code like this, while
simultaneously providing fewer, more-clean entry points for the 
virtualization
to occur.

Wrong.  I really don't want to get into a my virtualization approach is better
then yours.  But this is flat out wrong.
 
 
99% of the changes I'm talking about introducing are just:
- variable 
+ ptr-variable

There are more pieces mostly with when we initialize those variables but
that is the essence of the change.
 
 
 You completely dodged the main objection.  Which is OK if you are 
 selling something to marketing departments, but not OK

It is a pure illusion that one kind of virtualization is better then the other 
one.
Look, maybe *hardware* virtualization and a small size of the hypervisor
make you feel safer, however you totally forget about all the emulation drivers 
etc.
which can have bugs and security implications as well and maybe triggerable
from inside VM.

 Containers introduce chroot-jail-like features that give one a false 
 sense of security, while still requiring one to poke holes in the 
 illusion to get hardware-specific tasks accomplished.

The concepts of users in any OS (and in Linux in particular) give people
a false sense of security as well!
I know a bunch of examples of how one user can crash/DoS/abuse another one
e.g. on RHEL5 or mainstream.
But no one have problems with that and no one thinks that multiuserness
is a step-backward or maybe someone does?!

Yes, there are always bugs and sometimes security implications,
but people leave fine with it when it is fixed.

 The capable/not-capable model (i.e. superuser / normal user) is _still_ 
 being secured locally, even after decades of work and whitepapers and 
 audits.

 You are drinking Deep Kool-Aid if you think adding containers to the 
 myriad kernel subsystems does anything besides increasing fragility, and 
 decreasing security.  You are securing in-kernel subsystems against 
 other in-kernel subsystems.  superuser/user model made that difficult 
 enough... now containers add exponential audit complexity to that.  Who 
 is to say that a local root does not also pierce the container model?

containers do the only thing:
make sure that objects from one context are not visible to another one.

If containers are not used - everything returns to the case as it is now, i.e.
everything is global and globally visible and no auditing overhead at all.
So if you are not interested in containers - the code auditing
won't be noticably harder for you.

And as opposed to other virtualization approaches so far no one has been
able to measure the overhead.  I suspect there will be a few more cache
line misses somewhere but they haven't shown up yet.

If the only use was strong isolation which Dave complains about I would
concur that the namespace approach is inappropriate.  However there are
a lot other uses.
 
 
 Sure there are uses.  There are uses to putting the X server into the 
 kernel, too.  At some point complexity and featuritis has to take a back 
 seat to basic sanity.

I agree about sanity, however I totally disagree about complexity
you talk about.
Bugs we face/fix in Linux kernel which are found with the help of
that kind of virtualization makes me believe that Linux kernel
only wins from it.

Thanks,
Kirill

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [PATCH] Virtual ethernet tunnel

2007-06-07 Thread Kirill Korotaev
David Miller wrote:
 From: Pavel Emelianov [EMAIL PROTECTED]
 Date: Wed, 06 Jun 2007 19:11:38 +0400
 
 
Veth stands for Virtual ETHernet. It is a simple tunnel driver
that works at the link layer and looks like a pair of ethernet
devices interconnected with each other.
 
 
 I would suggest choosing a different name.
 
 'veth' is also the name of the virtualized ethernet device
 found on IBM machines, driven by driver/net/ibmveth.[ch]

AFAICS, ibmveth.c registers ethX devices, while this driver registers
vethX by default, so there is no much conflict IMHO.

But if you still don't like it, I would suggest 'ethtun' and later
we will come with iptun for IP tunneling. Is it ok for you?

Thanks,
Kirill

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Virtual ethernet tunnel

2007-06-07 Thread Kirill Korotaev
Deniel,

Daniel Lezcano wrote:
 Pavel Emelianov wrote:
 
I did this at the very first version, but Alexey showed me that this
would be wrong. Look. When we create the second device it must be in
the other namespace as it is useless to have them in one namespace.
But if we have the device in the other namespace the RTNL_NEWLINK
message from kernel would come into this namespace thus confusing ip
utility in the init namespace. Creating the device in the init ns and
moving it into the new one is rather a complex task.
  
  

Pavel,

moving the netdevice to another namespace is not a complex task. Eric
Biederman did it in its patchset ( cf.  http://lxc.sf.net/network )


By saying complex I didn't mean that this is difficult to implement,
but that it consists (must consist) of many stages. I.e. composite.
Making the device right in the namespace is liter.

  

When the pair device is created, both extremeties are into the init
namespace and you can choose to which namespace to move one extremity.


I do not mind that.
  

When the network namespace dies, the netdev is moved back to the init
namespace.
That facilitate network device management.

Concerning netlink events, this is automatically generated when the
network device is moved through namespaces.

IMHO, we should have the network device movement between namespaces in
order to be able to move a physical network device too (eg. you have 4
NIC and you want to create 3 containers and assign 3 NIC to each of them)


Agree. Moving the devices is a must-have functionality.

I do not mind making the pair in the init namespace and move the second
one into the desired namespace. But if we *always* will have two ends in
different namespaces what to complicate things for?
  
 
 Just to provide a netdev sufficiently generic to be used by people who 
 don't want namespaces but just want to do some network testing, like Ben 
 Greear does. He mentioned in a previous email, he will be happy to stop 
 redirecting people to out of tree patch.
 
 https://lists.linux-foundation.org/pipermail/containers/2007-April/004420.html

no one is against generic code and ability to create 2 interfaces in *one* 
namespace.
(Like we currently allow to do so in OpenVZ)

However, believe me, moving an interface is a *hard* operation. Much harder 
then netdev
register from the scratch.

Because it requires to take into account many things like:
- packets in flight which requires synchronize and is slow on big machines
- asynchronous sysfs entries registration/deregistration from
  rtln_unlock - netdev_run_todo
- name/ifindex collisions
- shutdown/cleanup of addresses/routes/qdisc and other similar stuff

Thanks,
Kirill

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/59] sysctl: C99 convert xfs ctl_tables

2007-01-17 Thread Kirill Korotaev
minor extra space in table below...

Kirill

 From: Eric W. Biederman [EMAIL PROTECTED] - unquoted
 
 Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
 ---
  fs/xfs/linux-2.6/xfs_sysctl.c |  258 
  1 files changed, 180 insertions(+), 78 deletions(-)
 
 diff --git a/fs/xfs/linux-2.6/xfs_sysctl.c b/fs/xfs/linux-2.6/xfs_sysctl.c
 index af777e9..5a0eefc 100644
 --- a/fs/xfs/linux-2.6/xfs_sysctl.c
 +++ b/fs/xfs/linux-2.6/xfs_sysctl.c
 @@ -55,95 +55,197 @@ xfs_stats_clear_proc_handler(
  #endif /* CONFIG_PROC_FS */
  
  STATIC ctl_table xfs_table[] = {
 - {XFS_RESTRICT_CHOWN, restrict_chown, xfs_params.restrict_chown.val,
 - sizeof(int), 0644, NULL, proc_dointvec_minmax,
 - sysctl_intvec, NULL,
 - xfs_params.restrict_chown.min, xfs_params.restrict_chown.max},
 -
 - {XFS_SGID_INHERIT, irix_sgid_inherit, xfs_params.sgid_inherit.val,
 - sizeof(int), 0644, NULL, proc_dointvec_minmax,
 - sysctl_intvec, NULL,
 - xfs_params.sgid_inherit.min, xfs_params.sgid_inherit.max},
 -
 - {XFS_SYMLINK_MODE, irix_symlink_mode, xfs_params.symlink_mode.val,
 - sizeof(int), 0644, NULL, proc_dointvec_minmax,
 - sysctl_intvec, NULL,
 - xfs_params.symlink_mode.min, xfs_params.symlink_mode.max},
 -
 - {XFS_PANIC_MASK, panic_mask, xfs_params.panic_mask.val,
 - sizeof(int), 0644, NULL, proc_dointvec_minmax,
 - sysctl_intvec, NULL,
 - xfs_params.panic_mask.min, xfs_params.panic_mask.max},
 -
 - {XFS_ERRLEVEL, error_level, xfs_params.error_level.val,
 - sizeof(int), 0644, NULL, proc_dointvec_minmax,
 - sysctl_intvec, NULL,
 - xfs_params.error_level.min, xfs_params.error_level.max},
 -
 - {XFS_SYNCD_TIMER, xfssyncd_centisecs, xfs_params.syncd_timer.val,
 - sizeof(int), 0644, NULL, proc_dointvec_minmax,
 - sysctl_intvec, NULL,
 - xfs_params.syncd_timer.min, xfs_params.syncd_timer.max},
 -
 - {XFS_INHERIT_SYNC, inherit_sync, xfs_params.inherit_sync.val,
 - sizeof(int), 0644, NULL, proc_dointvec_minmax,
 - sysctl_intvec, NULL,
 - xfs_params.inherit_sync.min, xfs_params.inherit_sync.max},
 -
 - {XFS_INHERIT_NODUMP, inherit_nodump, xfs_params.inherit_nodump.val,
 - sizeof(int), 0644, NULL, proc_dointvec_minmax,
 - sysctl_intvec, NULL,
 - xfs_params.inherit_nodump.min, xfs_params.inherit_nodump.max},
 -
 - {XFS_INHERIT_NOATIME, inherit_noatime, xfs_params.inherit_noatim.val,
 - sizeof(int), 0644, NULL, proc_dointvec_minmax,
 - sysctl_intvec, NULL,
 - xfs_params.inherit_noatim.min, xfs_params.inherit_noatim.max},
 -
 - {XFS_BUF_TIMER, xfsbufd_centisecs, xfs_params.xfs_buf_timer.val,
 - sizeof(int), 0644, NULL, proc_dointvec_minmax,
 - sysctl_intvec, NULL,
 - xfs_params.xfs_buf_timer.min, xfs_params.xfs_buf_timer.max},
 -
 - {XFS_BUF_AGE, age_buffer_centisecs, xfs_params.xfs_buf_age.val,
 - sizeof(int), 0644, NULL, proc_dointvec_minmax,
 - sysctl_intvec, NULL,
 - xfs_params.xfs_buf_age.min, xfs_params.xfs_buf_age.max},
 -
 - {XFS_INHERIT_NOSYM, inherit_nosymlinks, xfs_params.inherit_nosym.val,
 - sizeof(int), 0644, NULL, proc_dointvec_minmax,
 - sysctl_intvec, NULL,
 - xfs_params.inherit_nosym.min, xfs_params.inherit_nosym.max},
 -
 - {XFS_ROTORSTEP, rotorstep, xfs_params.rotorstep.val,
 - sizeof(int), 0644, NULL, proc_dointvec_minmax,
 - sysctl_intvec, NULL,
 - xfs_params.rotorstep.min, xfs_params.rotorstep.max},
 -
 - {XFS_INHERIT_NODFRG, inherit_nodefrag, xfs_params.inherit_nodfrg.val,
 - sizeof(int), 0644, NULL, proc_dointvec_minmax,
 - sysctl_intvec, NULL,
 - xfs_params.inherit_nodfrg.min, xfs_params.inherit_nodfrg.max},
 + {
 + .ctl_name   = XFS_RESTRICT_CHOWN,
 + .procname   = restrict_chown,
 + .data   = xfs_params.restrict_chown.val,
 + .maxlen = sizeof(int),
 + .mode   = 0644,
 + .proc_handler   = proc_dointvec_minmax,
 + .strategy   = sysctl_intvec,
 + .extra1 = xfs_params.restrict_chown.min,
 + .extra2 = xfs_params.restrict_chown.max
 + },
 + {
 + .ctl_name   = XFS_SGID_INHERIT,
 + .procname   = irix_sgid_inherit,
 + .data   = xfs_params.sgid_inherit.val,
 + .maxlen = sizeof(int),
 + .mode   = 0644,
 + .proc_handler   = proc_dointvec_minmax,
 + .strategy   = sysctl_intvec,
 + .extra1 = xfs_params.sgid_inherit.min,
 + .extra2 = xfs_params.sgid_inherit.max
 + },
 + {
 + .ctl_name   = XFS_SYMLINK_MODE,
 + .procname   = irix_symlink_mode,
 + .data   = xfs_params.symlink_mode.val,
 + .maxlen = sizeof(int),
 + .mode   = 0644,
 + .proc_handler   = 

Re: [PATCH 25/59] sysctl: C99 convert arch/frv/kernel/pm.c

2007-01-17 Thread Kirill Korotaev
another small minor note.

 From: Eric W. Biederman [EMAIL PROTECTED] - unquoted
 
 Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
 ---
  arch/frv/kernel/pm.c |   50 
 +++---
  1 files changed, 43 insertions(+), 7 deletions(-)
 
 diff --git a/arch/frv/kernel/pm.c b/arch/frv/kernel/pm.c
 index c1840d6..aa50333 100644
 --- a/arch/frv/kernel/pm.c
 +++ b/arch/frv/kernel/pm.c
 @@ -401,17 +401,53 @@ static int cm_sysctl(ctl_table *table, int __user 
 *name, int nlen,
  
  static struct ctl_table pm_table[] =
  {
 - {CTL_PM_SUSPEND, suspend, NULL, 0, 0200, NULL, sysctl_pm_do_suspend},
 - {CTL_PM_CMODE, cmode, clock_cmode_current, sizeof(int), 0644, NULL, 
 cmode_procctl, cmode_sysctl, NULL},
 - {CTL_PM_P0, p0, clock_p0_current, sizeof(int), 0644, NULL, 
 p0_procctl, p0_sysctl, NULL},
 - {CTL_PM_CM, cm, clock_cm_current, sizeof(int), 0644, NULL, 
 cm_procctl, cm_sysctl, NULL},
 - {0}
 + {
 + .ctl_name   = CTL_PM_SUSPEND,
 + .procname   = suspend,
 + .data   = NULL,
 + .maxlen = 0,
 + .mode   = 0200,
 + .proc_handler   = sysctl_pm_do_suspend,
 + },
 + {
 + .ctl_name   = CTL_PM_CMODE,
 + .procname   = cmode,
 + .data   = clock_cmode_current,
 + .maxlen = sizeof(int),
 + .mode   = 0644,
 + .proc_handler   = cmode_procctl,
 + .strategy   = cmode_sysctl,
 + },
 + {
 + .ctl_name   = CTL_PM_P0,
 + .procname   = p0,
 + .data   = clock_p0_current,
 + .maxlen = sizeof(int),
 + .mode   = 0644,
 + .proc_handler   = p0_procctl,
 + .strategy   = p0_sysctl,
 + },
 + {
 + .ctl_name   = CTL_PM_CM,
 + .procname   = cm,
 + .data   = clock_cm_current,
 + .maxlen = sizeof(int),
 + .mode   = 0644,
 + .proc_handler   = cm_procctl,
 + .strategy   = cm_sysctl,
 + },
 + { .ctl_name = 0}
in next patch (26/59) you write just { }. .ctl_name = 0 not required here.


  };
  
  static struct ctl_table pm_dir_table[] =
  {
 - {CTL_PM, pm, NULL, 0, 0555, pm_table},
 - {0}
 + {
 + .ctl_name   = CTL_PM,
 + .procname   = pm,
 + .mode   = 0555,
 + .child  = pm_table,
 + },
 + { .ctl_name = 0}
  };
  
  /*

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 33/59] sysctl: s390 move sysctl definitions to sysctl.h

2007-01-17 Thread Kirill Korotaev
IDs not sorted in enum. see below.

 From: Eric W. Biederman [EMAIL PROTECTED] - unquoted
 
 We need to have the the definition of all top level sysctl
 directories registers in sysctl.h so we don't conflict by
 accident and cause abi problems.
 
 Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
 ---
  arch/s390/appldata/appldata.h |3 +--
  arch/s390/kernel/debug.c  |1 -
  arch/s390/mm/cmm.c|4 
  include/linux/sysctl.h|7 +++
  4 files changed, 8 insertions(+), 7 deletions(-)
 
 diff --git a/arch/s390/appldata/appldata.h b/arch/s390/appldata/appldata.h
 index 0429481..4069b81 100644
 --- a/arch/s390/appldata/appldata.h
 +++ b/arch/s390/appldata/appldata.h
 @@ -21,8 +21,7 @@
  #define APPLDATA_RECORD_NET_SUM_ID   0x03/* must be  256 ! */
  #define APPLDATA_RECORD_PROC_ID  0x04
  
 -#define CTL_APPLDATA 2120/* sysctl IDs, must be unique */
 -#define CTL_APPLDATA_TIMER   2121
 +#define CTL_APPLDATA_TIMER   2121/* sysctl IDs, must be unique */
  #define CTL_APPLDATA_INTERVAL2122
  #define CTL_APPLDATA_MEM 2123
  #define CTL_APPLDATA_OS  2124
 diff --git a/arch/s390/kernel/debug.c b/arch/s390/kernel/debug.c
 index bb57bc0..c81f8e5 100644
 --- a/arch/s390/kernel/debug.c
 +++ b/arch/s390/kernel/debug.c
 @@ -852,7 +852,6 @@ debug_finish_entry(debug_info_t * id, debug_entry_t* 
 active, int level,
  static int debug_stoppable=1;
  static int debug_active=1;
  
 -#define CTL_S390DBF 5677
  #define CTL_S390DBF_STOPPABLE 5678
  #define CTL_S390DBF_ACTIVE 5679
  
 diff --git a/arch/s390/mm/cmm.c b/arch/s390/mm/cmm.c
 index 607f50e..df733d5 100644
 --- a/arch/s390/mm/cmm.c
 +++ b/arch/s390/mm/cmm.c
 @@ -256,10 +256,6 @@ cmm_skip_blanks(char *cp, char **endp)
  }
  
  #ifdef CONFIG_CMM_PROC
 -/* These will someday get removed. */
 -#define VM_CMM_PAGES 
 -#define VM_CMM_TIMED_PAGES   1112
 -#define VM_CMM_TIMEOUT   1113
  
  static struct ctl_table cmm_table[];
  
 diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
 index 71c16b4..56d0161 100644
 --- a/include/linux/sysctl.h
 +++ b/include/linux/sysctl.h
 @@ -73,6 +73,8 @@ enum
   CTL_SUNRPC=7249,/* sunrpc debug */
   CTL_PM=9899,/* frv power management */
   CTL_FRV=9898,   /* frv specific sysctls */
 + CTL_S390DBF=5677,   /* s390 debug */
 + CTL_APPLDATA=2120,  /* s390 appldata */
 not sorted by ID? imho should be sorted. otherwise can'be unnotied when 
inserted above.

  };
  
  /* CTL_BUS names: */
 @@ -205,6 +207,11 @@ enum
   VM_PANIC_ON_OOM=33, /* panic at out-of-memory */
   VM_VDSO_ENABLED=34, /* map VDSO into new processes? */
   VM_MIN_SLAB=35,  /* Percent pages ignored by zone reclaim */
 +
 + /* s390 vm cmm sysctls */
 + VM_CMM_PAGES=,
 + VM_CMM_TIMED_PAGES=1112,
 + VM_CMM_TIMEOUT=1113,
  };
  
  

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 50/59] sysctl: Move utsname sysctls to their own file

2007-01-17 Thread Kirill Korotaev
Eric, though I personally don't care much:
1. I ask for not setting your authorship/copyright on the code which you just 
copied
  from other places. Just doesn't look polite IMHO.
2. I would propose to not introduce utsname_sysctl.c.
  both files are too small and minor that I can't see much reasons splitting 
them.

Kirill

 From: Eric W. Biederman [EMAIL PROTECTED] - unquoted
 
 This is just a simple cleanup to keep kernel/sysctl.c
 from getting to crowded with special cases, and by
 keeping all of the utsname logic to together it makes
 the code a little more readable.
 
 Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
 ---
  kernel/Makefile |1 +
  kernel/sysctl.c |  115 -
  kernel/utsname_sysctl.c |  146 
 +++
  3 files changed, 147 insertions(+), 115 deletions(-)
 
 diff --git a/kernel/Makefile b/kernel/Makefile
 index 14f4d45..d286c44 100644
 --- a/kernel/Makefile
 +++ b/kernel/Makefile
 @@ -48,6 +48,7 @@ obj-$(CONFIG_SECCOMP) += seccomp.o
  obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
  obj-$(CONFIG_RELAY) += relay.o
  obj-$(CONFIG_UTS_NS) += utsname.o
 +obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
  obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
  obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
  
 diff --git a/kernel/sysctl.c b/kernel/sysctl.c
 index 7420761..a8c0a03 100644
 --- a/kernel/sysctl.c
 +++ b/kernel/sysctl.c
 @@ -135,13 +135,6 @@ static int parse_table(int __user *, int, void __user *, 
 size_t __user *,
   void __user *, size_t, ctl_table *);
  #endif
  
 -static int proc_do_uts_string(ctl_table *table, int write, struct file *filp,
 -   void __user *buffer, size_t *lenp, loff_t *ppos);
 -
 -static int sysctl_uts_string(ctl_table *table, int __user *name, int nlen,
 -   void __user *oldval, size_t __user *oldlenp,
 -   void __user *newval, size_t newlen);
 -
  #ifdef CONFIG_SYSVIPC
  static int sysctl_ipc_data(ctl_table *table, int __user *name, int nlen,
 void __user *oldval, size_t __user *oldlenp,
 @@ -174,27 +167,6 @@ extern ctl_table inotify_table[];
  int sysctl_legacy_va_layout;
  #endif
  
 -static void *get_uts(ctl_table *table, int write)
 -{
 - char *which = table-data;
 -#ifdef CONFIG_UTS_NS
 - struct uts_namespace *uts_ns = current-nsproxy-uts_ns;
 - which = (which - (char *)init_uts_ns) + (char *)uts_ns;
 -#endif
 - if (!write)
 - down_read(uts_sem);
 - else
 - down_write(uts_sem);
 - return which;
 -}
 -
 -static void put_uts(ctl_table *table, int write, void *which)
 -{
 - if (!write)
 - up_read(uts_sem);
 - else
 - up_write(uts_sem);
 -}
  
  #ifdef CONFIG_SYSVIPC
  static void *get_ipc(ctl_table *table, int write)
 @@ -275,51 +247,6 @@ static ctl_table root_table[] = {
  
  static ctl_table kern_table[] = {
   {
 - .ctl_name   = KERN_OSTYPE,
 - .procname   = ostype,
 - .data   = init_uts_ns.name.sysname,
 - .maxlen = sizeof(init_uts_ns.name.sysname),
 - .mode   = 0444,
 - .proc_handler   = proc_do_uts_string,
 - .strategy   = sysctl_uts_string,
 - },
 - {
 - .ctl_name   = KERN_OSRELEASE,
 - .procname   = osrelease,
 - .data   = init_uts_ns.name.release,
 - .maxlen = sizeof(init_uts_ns.name.release),
 - .mode   = 0444,
 - .proc_handler   = proc_do_uts_string,
 - .strategy   = sysctl_uts_string,
 - },
 - {
 - .ctl_name   = KERN_VERSION,
 - .procname   = version,
 - .data   = init_uts_ns.name.version,
 - .maxlen = sizeof(init_uts_ns.name.version),
 - .mode   = 0444,
 - .proc_handler   = proc_do_uts_string,
 - .strategy   = sysctl_uts_string,
 - },
 - {
 - .ctl_name   = KERN_NODENAME,
 - .procname   = hostname,
 - .data   = init_uts_ns.name.nodename,
 - .maxlen = sizeof(init_uts_ns.name.nodename),
 - .mode   = 0644,
 - .proc_handler   = proc_do_uts_string,
 - .strategy   = sysctl_uts_string,
 - },
 - {
 - .ctl_name   = KERN_DOMAINNAME,
 - .procname   = domainname,
 - .data   = init_uts_ns.name.domainname,
 - .maxlen = sizeof(init_uts_ns.name.domainname),
 - .mode   = 0644,
 - .proc_handler   = proc_do_uts_string,
 - .strategy   = sysctl_uts_string,
 - },
 - {
   .ctl_name   = KERN_PANIC,
   .procname   = panic,
   .data   = panic_timeout,
 @@ 

Re: [PATCH 51/59] sysctl: Move SYSV IPC sysctls to their own file

2007-01-17 Thread Kirill Korotaev
1. I ask for not setting your authorship/copyright on the code which you just 
copied
   from other places. Just doesn't look polite IMHO.
2. please don't name files like ipc/ipc_sysctl.c
   ipc/sysctl.c sounds better IMHO.
3. any reason to introduce CONFIG_SYSVIPC_SYSCTL?
   why not simply do
+obj-$(CONFIG_SYSCTL) += sysctl.o
   instead?

Kirill

 From: Eric W. Biederman [EMAIL PROTECTED] - unquoted
 
 This is just a simple cleanup to keep kernel/sysctl.c
 from getting to crowded with special cases, and by
 keeping all of the ipc logic to together it makes
 the code a little more readable.
 
 Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
 ---
  init/Kconfig |6 ++
  ipc/Makefile |1 +
  ipc/ipc_sysctl.c |  182 
 ++
  kernel/sysctl.c  |  174 ---
  4 files changed, 189 insertions(+), 174 deletions(-)
 
 diff --git a/init/Kconfig b/init/Kconfig
 index a3f83e2..33bc38d 100644
 --- a/init/Kconfig
 +++ b/init/Kconfig
 @@ -116,6 +116,12 @@ config SYSVIPC
 section 6.4 of the Linux Programmer's Guide, available from
 http://www.tldp.org/guides.html.
  
 +config SYSVIPC_SYSCTL
 + bool
 + depends on SYSVIPC
 + depends on SYSCTL
 + default y
 +
  config IPC_NS
   bool IPC Namespaces
   depends on SYSVIPC
 diff --git a/ipc/Makefile b/ipc/Makefile
 index 0a6d626..b93bba6 100644
 --- a/ipc/Makefile
 +++ b/ipc/Makefile
 @@ -4,6 +4,7 @@
  
  obj-$(CONFIG_SYSVIPC_COMPAT) += compat.o
  obj-$(CONFIG_SYSVIPC) += util.o msgutil.o msg.o sem.o shm.o
 +obj-$(CONFIG_SYSVIPC_SYSCTL) += ipc_sysctl.o
  obj_mq-$(CONFIG_COMPAT) += compat_mq.o
  obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o $(obj_mq-y)
  
 diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
 new file mode 100644
 index 000..9018009
 --- /dev/null
 +++ b/ipc/ipc_sysctl.c
 @@ -0,0 +1,182 @@
 +/*
 + *  Copyright (C) 2007
 + *
 + *  Author: Eric Biederman [EMAIL PROTECTED]
 + *
 + *  This program is free software; you can redistribute it and/or
 + *  modify it under the terms of the GNU General Public License as
 + *  published by the Free Software Foundation, version 2 of the
 + *  License.
 + */
 +
 +#include linux/module.h
 +#include linux/ipc.h
 +#include linux/nsproxy.h
 +#include linux/sysctl.h
 +
 +#ifdef CONFIG_IPC_NS
 +static void *get_ipc(ctl_table *table)
 +{
 + char *which = table-data;
 + struct ipc_namespace *ipc_ns = current-nsproxy-ipc_ns;
 + which = (which - (char *)init_ipc_ns) + (char *)ipc_ns;
 + return which;
 +}
 +#else
 +#define get_ipc(T) ((T)-data)
 +#endif
 +
 +#ifdef CONFIG_PROC_FS
 +static int proc_ipc_dointvec(ctl_table *table, int write, struct file *filp,
 + void __user *buffer, size_t *lenp, loff_t *ppos)
 +{
 + struct ctl_table ipc_table;
 + memcpy(ipc_table, table, sizeof(ipc_table));
 + ipc_table.data = get_ipc(table);
 +
 + return proc_dointvec(ipc_table, write, filp, buffer, lenp, ppos);
 +}
 +
 +static int proc_ipc_doulongvec_minmax(ctl_table *table, int write,
 + struct file *filp, void __user *buffer, size_t *lenp, loff_t *ppos)
 +{
 + struct ctl_table ipc_table;
 + memcpy(ipc_table, table, sizeof(ipc_table));
 + ipc_table.data = get_ipc(table);
 +
 + return proc_doulongvec_minmax(ipc_table, write, filp, buffer,
 + lenp, ppos);
 +}
 +
 +#else
 +#define proc_ipc_do_ulongvec_minmax NULL
 +#define proc_ipc_do_intvec   NULL
 +#endif
 +
 +#ifdef CONFIG_SYSCTL_SYSCALL
 +/* The generic sysctl ipc data routine. */
 +static int sysctl_ipc_data(ctl_table *table, int __user *name, int nlen,
 + void __user *oldval, size_t __user *oldlenp,
 + void __user *newval, size_t newlen)
 +{
 + size_t len;
 + void *data;
 +
 + /* Get out of I don't have a variable */
 + if (!table-data || !table-maxlen)
 + return -ENOTDIR;
 +
 + data = get_ipc(table);
 + if (!data)
 + return -ENOTDIR;
 +
 + if (oldval  oldlenp) {
 + if (get_user(len, oldlenp))
 + return -EFAULT;
 + if (len) {
 + if (len  table-maxlen)
 + len = table-maxlen;
 + if (copy_to_user(oldval, data, len))
 + return -EFAULT;
 + if (put_user(len, oldlenp))
 + return -EFAULT;
 + }
 + }
 +
 + if (newval  newlen) {
 + if (newlen  table-maxlen)
 + newlen = table-maxlen;
 +
 + if (copy_from_user(data, newval, newlen))
 + return -EFAULT;
 + }
 + return 1;
 +}
 +#else
 +#define sysctl_ipc_data NULL
 +#endif
 +
 +static struct ctl_table ipc_kern_table[] = {
 + {
 + .ctl_name   = KERN_SHMMAX,
 + .procname   = shmmax,
 + .data   = 

Re: [PATCH 0/59] Cleanup sysctl

2007-01-17 Thread Kirill Korotaev
Eric, really good job!

Patches: 1-13, 15-24, 26-32, 34-44, 46-49, 52-55, 57 (all except below)
Acked-By: Kirill Korotaev [EMAIL PROTECTED]

14/59 - minor (extra space)
25/59 - minor note  
33/59 - not sorted sysctl IDs
45/59 - typo
50/59 - copyright/file note
51/59 - copyright/file name/kconfig option notes

56,58,59/59 - will review tomorrow

another issue I have to think over is removal of de-owner.
Alexey Dobriyan has sent recently patching fixing /proc - modules refcounting.
I guess w/o these patches your changes are not safe if proc_handler or strategy
are functions from the module.

Thanks,
Kirill

 There has not been much maintenance on sysctl in years, and as a result is
 there is a lot to do to allow future interesting work to happen, and being
 ambitious I'm trying to do it all at once :)
 
 The patches in this series fall into several general categories.
 
 - Removal of useless attempts to override the standard sysctls
 
 - Registers of sysctl numbers in sysctl.h so someone else does not use
   the magic number and conflict.
 
 - C99 conversions so it becomes possible to change the layout of 
   struct ctl_table without breaking everything.
 
 - Removal of useless claims of module ownership, in the proc dir entries
 
 - Removal of sys_sysctl support where people had used conflicting sysctl
   numbers. Trying to break glibc or other applications by changing the
   ABI is not cool.  9 instances of this in the kernel seems a little
   extreme.
 
 - General enhancements when I got the junk I could see out.
 
 Odds are I missed something, most of the cleanups are simply a result of
 me working on the sysctl core and glancing at the users and going: What?
 
 Eric
 ___
 Containers mailing list
 [EMAIL PROTECTED]
 https://lists.osdl.org/mailman/listinfo/containers
 
 

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: Network virtualization/isolation

2006-12-06 Thread Kirill Korotaev
If there is a better and less intrusive while still being obvious
method I am all for it.  I do not like the OpenVZ thing of doing the
lookup once and then stashing the value in current and the special
casing the exceptions.

Why?
 
 
 I like it when things are obvious and not implied.
 
 The implementations seems to favor fewer lines of code touched over
 maintainability of the code.  Which if you are maintaining out of
 tree code is fine.  At leas that was my impression last time
 I looked at the code.
FYI, when we started doing networking virtualization many years ago
we tried both approaches.
Over time, context notion looked much more natural and easier for us.
Even Alexey Kuznetsov tells that he prefers exec_env as the logic
becomes very clear and little mess is introduced.

 I know there are a lot of silly things in the existing implementations
 because they were initially written without the expectation of being
 able to merge the code into the main kernel.  This resulted in some
 non-general interfaces, and a preference for patches that touch
 as few lines of code as possible.  
Sure, but OpenVZ code is being constantly cleaned from such code
and we are open for discussion. No one pretends that code is perferct
from the beginning.

 Anyway this has bit has been discussed before and we can discuss it
 seriously in the context of patch review.
Let me explain when explicit context like exec_env IMHO is cleaner:
- context is a natural notion of linux kernel. e.g. current.
  why not pass 'current' to all the functions as an argument
  starting from entry.S?
  in_atomic(), in_interrupt() etc. all these functions deal with current 
context.
  IMHO when one needs to pass an argument too many times like 'current'
  it is better to use a notion of the context.
- e.g. NFS should set networking context of the mount point or socket.
But, ok, it is not the real point to argue so much imho and waste our time 
instead of
doing things.

Thanks,
Kirill

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [RFC] network namespaces

2006-09-07 Thread Kirill Korotaev

Herbert Poetzl wrote:


my point (until we have an implementation which clearly
shows that performance is equal/better to isolation)
is simply this:

of course, you can 'simulate' or 'construct' all the
isolation scenarios with kernel bridging and routing
and tricky injection/marking of packets, but, this
usually comes with an overhead ...
 


Well, TANSTAAFL*, and pretty much everything comes with an overhead. 
Multitasking comes with the (scheduler, context switch, CPU cache, etc.) 
overhead -- is that the reason to abandon it? OpenVZ and Linux-VServer 
resource management also adds some overhead -- do we want to throw it away?


The question is not just equal or better performance, the question is 
what do we get and how much we pay for it.



Equal or better performance is certainly required when we have the code
compiled in but aren't using it.  We must not penalize the current code.

you talk about host system performance.
Both approaches do not introduce overhead to host networking.

Finally, as I understand both network isolation and network 
virtualization (both level2 and level3) can happily co-exist. We do have 
several filesystems in kernel. Let's have several network virtualization 
approaches, and let a user choose. Is that makes sense?



If there are not compelling arguments for using both ways of doing
it is silly to merge both, as it is more maintenance overhead.

That said I think there is a real chance if we can look at the bind
filtering and find a way to express that in the networking stack
through iptables.  Using the security hooks conflicts with things
like selinux.   Although it would be interesting to see if selinux
can already implement general purpose layer 3 filtering.

The more I look the gut feel I have is that the way to proceed would
be to add a new table that filters binds, and connects.  Plus a new
module that would look at a process creating a socket and tell us if
it is the appropriate group of processes.  With a little care that
would be a general solution to the layer 3 filtering problem.

Huh, you will still have to insert lots of access checks into different
parts of code like RAW sockets, netlinks, protocols which are not inserted,
netfilters (to not allow create iptables rules :) ) and many many other places.

I see Dave Miller looking at such a patch and my ears hear his rude words :)

Thanks,
Kirill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-09-06 Thread Kirill Korotaev

On Tue, Sep 05, 2006 at 08:45:39AM -0600, Eric W. Biederman wrote:


Daniel Lezcano [EMAIL PROTECTED] writes:

For HPC if you are interested in migration you need a separate IP
per container. If you can take you IP address with you migration of
networking state is simple. If you can't take your IP address with you
a network container is nearly pointless from a migration perspective.

Beyond that from everything I have seen layer 2 is just much cleaner
than any layer 3 approach short of Serge's bind filtering.


well, the 'ip subset' approach Linux-VServer and
other Jail solutions use is very clean, it just does
not match your expectations of a virtual interface
(as there is none) and it does not cope well with
all kinds of per context 'requirements', which IMHO
do not really exist on the application layer (only
on the whole system layer)



I probably expressed that wrong.  There are currently three
basic approaches under discussion.
Layer 3 (Basically bind filtering) nothing at the packet level.
   The approach taken by Serge's version of bsdjails and Vserver.

Layer 2.5 What Daniel proposed.

Layer 2.  (Trivially mapping each packet to a different interface)
   And then treating everything as multiple instances of the
   network stack.
Roughly what OpenVZ and I have implemented.

I think classifying network virtualization by Layer X is not good enough.
OpenVZ has Layer 3 (venet) and Layer 2 (veth) implementations, but
in both cases networking stack inside VE remains fully virtualized.

Thanks,
Kirill

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-09-05 Thread Kirill Korotaev

Yes, performance is probably one issue.

My concerns was for layer 2 / layer 3 virtualization. I agree a layer 2 
isolation/virtualization is the best for the system container.
But there is another family of container called application container, 
it is not a system which is run inside a container but only the 
application. If you want to run a oracle database inside a container, 
you can run it inside an application container without launching init 
and all the services.


This family of containers are used too for HPC (high performance 
computing) and for distributed checkpoint/restart. The cluster runs 
hundred of jobs, spawning them on different hosts inside an application 
container. Usually the jobs communicates with broadcast and multicast.
Application containers does not care of having different MAC address and 
rely on a layer 3 approach.


Are application containers comfortable with a layer 2 virtualization ? I 
 don't think so, because several jobs running inside the same host 
communicate via broadcast/multicast between them and between other jobs 
running on different hosts. The IP consumption is a problem too: 1 
container == 2 IP (one for the root namespace/ one for the container), 
multiplicated with the number of jobs. Furthermore, lot of jobs == lot 
of virtual devices.


However, after a discussion with Kirill at the OLS, it appears we can 
merge the layer 2 and 3 approaches if the level of network 
virtualization is tunable and we can choose layer 2 or layer 3 when 
doing the unshare. The determination of the namespace for the incoming 
traffic can be done with an specific iptable module as a first step. 
While looking at the network namespace patches, it appears that the 
TCP/UDP part is **very** similar at what is needed for a layer 3 approach.


Any thoughts ?

My humble opinion is that your approach doesn't intersect with this one.
So we can freely go with both *if needed*.
And hear the comments from network guru guys and what and how to improve.

So I suggest you at least to send the patches, so we could discuss it.

Thanks,
Kirill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] neigh_table_clear() doesn't free stats

2006-09-01 Thread Kirill Korotaev

neigh_table_clear() doesn't free tbl-stats.
Found by Alexey Kuznetsov. Though Alexey considers this
leak minor for mainstream, I still believe that cleanup
code should not forget to free some of the resources :)

At least, this is critical for OpenVZ with virtualized
neighbour tables.

Signed-Off-By: Kirill Korotaev [EMAIL PROTECTED]


diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 89b7904..a45bd21 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1429,6 +1429,9 @@ int neigh_table_clear(struct neigh_table
kfree(tbl-phash_buckets);
tbl-phash_buckets = NULL;

+   free_percpu(tbl-stats);
+   tbl-stats = NULL;
+
return 0;
}

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] network namespaces: playing and debugging

2006-08-17 Thread Kirill Korotaev

Temporary code to play with network namespaces in the simplest way.
Do
   exec 7 /proc/net/net_ns
in your bash shell and you'll get a brand new network namespace.
There you can, for example, do
   ip link set lo up
   ip addr list
   ip addr add 1.2.3.4 dev lo
   ping -n 1.2.3.4

Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]



NACK, new /proc interfaces are not acceptable.


As you can find from the comment this patch is just for playing
with network namespace.

Kirill

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-08-17 Thread Kirill Korotaev

Basically there are currently 3 approaches that have been proposed.

The trivial bsdjail style as implemented by Serge and in a slightly
more sophisticated version in vserver.  This approach as it does not
touch the packets has little to no packet level overhead.  Basically
this is what I have called the Level 3 approach.

The more in depth approach where we modify the packet processing based
upon which network interface the packet comes in on, and it looks like
each namespace has it's own instance of the network stack. Roughly
what was proposed earlier in this thread the Level 2 approach.  This
potentially has per packet overhead so we need to watch the implementation
very carefully.

Some weird hybrid as proposed by Daniel, that I was never clear on the
semantics.

The good thing is that these approaches do not contradict each other.
We discussed it with Daniel during the summit and as Andrey proposed
some shortcuts can be created to avoid double stack traversing.


From the previous conversations my impression was that as long as

we could get a Layer 2 approach that did not slow down the networking
stack and was clean, everyone would be happy.

agree.


I'm buried in the process id namespace at the moment, and except
to be so for the rest of the month, so I'm not going to be
very helpful except for a few stray comments.

I will be very much obliged if you find some time to review these new
patches so that we could make some progress here.

Thanks,
Kirill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] limit rt cache size

2006-08-09 Thread Kirill Korotaev

1) dynamic table growth is the only reasonable way to
  handle this and not waste memory in all cases




Definitely that's the ideal way to go.

But there's alot of state to update (more or less
atomically, too) in the TCP hashes. Looks tricky to
do that without hurting performance, especially since
you'll probably want to resize the tables when you've
discovered they're full and busy

and the memory if fragmented too! :/

Kirill

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] limit rt cache size

2006-08-09 Thread Kirill Korotaev

2) for cases where we haven't implemented dynamic
  table growth, specifying a proper limit argument
  to the hash table allocation is a sufficient
  solution for the time being



Agreed, just we don't know what the proper limits are. 


I guess it would need someone running quite a lot of benchmarks.
Anyone volunteering? @)

In my original post I noted how it is quite easy to consume
the whole 1Gb of RAM on i686 PC (and it's only 4,194,304 entries)
it looks like with more IP addresses it is not that hard to
consume much more memory.


Or do some cheesy default and document the options to change
it clearly and wait for feedback from users on what works for
them?


Kirill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] add_timer - mod_timer() in dst_run_gc()

2006-08-09 Thread Kirill Korotaev

Patch from Dmitry Mishin [EMAIL PROTECTED]:
Replace add_timer() by mod_timer() in dst_run_gc
in order to avoid BUG message.

  CPU1CPU2
dst_run_gc()  entered   dst_run_gc() entered
spin_lock(dst_lock)   .
del_timer(dst_gc_timer) fail to get lock
   mod_timer() --- puts 
timer back

to the list
add_timer(dst_gc_timer) --- BUG because timer is in list already.

Found during OpenVZ internal testing.

At first we thought that it is OpenVZ specific as we
added dst_run_gc(0) call in dst_dev_event(),
but as Alexey pointed to me it is possible to trigger
this condition in mainstream kernel.

F.e. timer has fired on CPU2, but the handler was preeempted
by an irq before dst_lock is tried.
Meanwhile, someone on CPU1 adds an entry to gc list and
starts the timer.
If CPU2 was preempted long enough, this timer can expire
simultaneously with resuming timer handler on CPU1, arriving
exactly to the situation described.

Signed-Off-By: Dmitry Mishin [EMAIL PROTECTED]
Signed-Off-By: Kirill Korotaev [EMAIL PROTECTED]
Signed-Off-By: Alexey Kuznetsov [EMAIL PROTECTED]


--- ./net/core/dst.c.dst2006-05-19 13:12:34.0 +0400
+++ ./net/core/dst.c2006-05-22 14:29:50.0 +0400
@@ -95,12 +95,11 @@ static void dst_run_gc(unsigned long dum
dst_gc_timer_inc = DST_GC_INC;
dst_gc_timer_expires = DST_GC_MIN;
}
-   dst_gc_timer.expires = jiffies + dst_gc_timer_expires;
#if RT_CACHE_DEBUG = 2
printk(dst_total: %d/%d %ld\n,
   atomic_read(dst_total), delayed,  dst_gc_timer_expires);
#endif
-   add_timer(dst_gc_timer);
+   mod_timer(dst_gc_timer, jiffies + dst_gc_timer_expires);

out:
spin_unlock(dst_lock);

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] limit rt cache size

2006-08-08 Thread Kirill Korotaev

David Miller wrote:

we quickly discover this GIT commit:

424c4b70cc4ff3930ee36a2ef7b204e4d704fd26

[IPV4]: Use the fancy alloc_large_system_hash() function for route hash table

- rt hash table allocated using alloc_large_system_hash() function.

Signed-off-by: Eric Dumazet [EMAIL PROTECTED]
Signed-off-by: David S. Miller [EMAIL PROTECTED]

And it is clear that old code used num_physpages, which counts low
memory only.  This shows clearly that Eric's usage of the HASH_HIGHMEM
flag here is erroneous.  So we should remove it.

at least for i686 num_physpages includes highmem, so IMHO this bug was there 
for years:

./arch/i386/mm/init.c:
static void __init set_max_mapnr_init(void)
{
#ifdef CONFIG_HIGHMEM
   num_physpages = highend_pfn;
#else
   num_physpages = max_low_pfn;
#endif
}


Look!  This thing even uses num_physpages in current code to compute
the scale argument to alloc_large_system_hash() :)))

the same bug here? :) the good thing is that it only select scale 15 or 17.
no any other possible choice here :)))


What's about routing cache size, it looks like it is another bug.
route.c should not force rt_max_size = 16*rt_hash_size.
I think it should consult available memory and to limit rt_max_size
to some reasonable value, even if hash size is too high.



Sure.  This current setting of 16*rt_hash_size is meant to
try to limit hash chain lengths I guess.  2.4.x does the same
thing.  Note also that by basing it upon number of routing cache
hash chains, it is effectively consulting available memory.
This is why when hash table sizing is crap so it rt_max_size
calculation.  Fix one and you fix them both :)

imho chain lengh limitation to 16 is not that bad, but to avoid such features
probably should be fixed :)


Once the HASH_HIGHMEM flag is removed, assuming system has  128K of
memory, what we get is:

hash_chains = lowmem / 128K
rt_max_size = ((lowmem / 128K) * 16) == lowmem / 8K

So we allow one routing cache entry for each 8K of lowmem we have :)

So for now it is probably sufficient to just get rid of the
HASH_HIGHMEM flag here.  Later we can try changing this multiplier
of 16 to something like 8 or even 4.

should we remove it for TCP hashes?

Thanks,
Kirill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] limit rt cache size

2006-08-08 Thread Kirill Korotaev

David Miller wrote:

From: Andi Kleen [EMAIL PROTECTED]
Date: Tue, 8 Aug 2006 08:53:03 +0200



That's still too big. Consider a 2TB machine, with all memory in LOWMEM.



Andi I agree with you, route.c should pass in a suitable limit.
I'm just suggesting a fix for a seperate problem.


So summaring up we have the following issues imho:
1a) rt hash size should be calculated based on lowmem size, not total size
1b) rt hash size should have some upper limit (requested by Andi Kleen for huge 
mem systems)
2a) max number of rt hash entries should be calculated based on low memory, not 
as
  rt_hash_chains*16.
2b) when CONFIG_DEBUG_SLAB and CONFIG_DEBUG_PAGE_ALLOC are ON, 2a) should be 
corrected
  taking into account _real_ rtable entry size (one page instead of 256b!!!).
3) should we limit TCP hashe and hashb size the same way?

If I haven't missed something I will prepare a patch for 1-2) and
a separate patch for 3).

Thanks,
Kirill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] Network namespaces: IPv4 FIB/routing in namespaces

2006-06-28 Thread Kirill Korotaev

Structures related to IPv4 rounting (FIB and routing cache)
are made per-namespace.


Hi Andrey,

if the ressources are private to the namespace, how do you will handle 
NFS mounted before creating the network namespace ? Do you take care of 
that or simply assume you can't access NFS anymore ?



This is a question that brings up another level of interaction between
networking and the rest of kernel code.
Solution that I use now makes the NFS communication part always run in
the root namespace.  This is discussable, of course, but it's a far more
complicated matter than just device lists or routing :)
if we had containers (not namespaces) then it would be also possible to 
run NFS in context of the appropriate container and thus each user could 
 mount NFS itself with correct networking context.


it's another thing which ties subsytems and makes namespaces ugly :/

Kirill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][patch 1/4] Network namespaces: cleanup of dev_base list use

2006-06-27 Thread Kirill Korotaev

Cleanup of dev_base list use, with the aim to make device list per-namespace.
In almost every occasion, use of dev_base variable and dev-next pointer
could be easily replaced by for_each_netdev loop.
A few most complicated places were converted to using
first_netdev()/next_netdev().



As a proof of concept patch this is ok.

As a real world patch this is much too big, which prevents review.
Plus it takes a few actions that are more than replace just
iterators through the device list.
Mmm, actually it is a whole changeset and should go as a one patch. I 
didn't find it to be big and my review took only 5-10mins..

I also don't think that mailing each driver maintainer is a good idea.
Only if we want to make some buzz :)

Kirill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][patch 1/4] Network namespaces: cleanup of dev_base list use

2006-06-27 Thread Kirill Korotaev

Cleanup of dev_base list use, with the aim to make device list per-namespace.
In almost every occasion, use of dev_base variable and dev-next pointer
could be easily replaced by for_each_netdev loop.
A few most complicated places were converted to using
first_netdev()/next_netdev().


As a proof of concept patch this is ok.
As a real world patch this is much too big, which prevents review.
Plus it takes a few actions that are more than replace just
iterators through the device list.


Mmm, actually it is a whole changeset and should go as a one patch. I didn't
find it to be big and my review took only 5-10mins..
I also don't think that mailing each driver maintainer is a good idea.
Only if we want to make some buzz :)



Thanks for supporting my case.  You reviewed it and missed the obvious typo.
I do agree that a patchset doing it all should happen at once.
This doesn't support anything. e.g. I caught quite a lot of bugs after 
Ingo Molnar, but this doesn't make his code poor. People are people.

Anyway, I would be happy to see the typo.


As for not mailing the maintainers of the code we are changing.  That
would just be irresponsible.


Kirill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html