Re: [PATCH 2.6.25] netns: struct net content re-work
Daniel Lezcano wrote: Denis V. Lunev wrote: Recently David Miller and Herbert Xu pointed out that struct net becomes overbloated and un-maintainable. There are two solutions: - provide a pointer to a network subsystem definition from struct net. This costs an additional dereferrence - place sub-system definition into the structure itself. This will speedup run-time access at the cost of recompilation time The second approach looks better for us. Yes, we do not need/want a pointer in this structure and add more dereference in the network code. Other sub-systems will be converted to this approach if this will be accepted :) Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] --- diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h index b62e31f..f60e1ce 100644 --- a/include/net/net_namespace.h +++ b/include/net/net_namespace.h @@ -8,6 +8,8 @@ #include linux/workqueue.h #include linux/list.h +#include net/netns/unix.h + struct proc_dir_entry; struct net_device; struct sock; @@ -46,8 +48,7 @@ struct net { struct hlist_head packet_sklist; /* unix sockets */ - int sysctl_unix_max_dgram_qlen; - struct ctl_table_header *unix_ctl; + struct netns_unix unx; Can you change this from unx to unix ? no, it won't compile. Guess why :) If you encapsulate the structure definitions per subsystem, you can drop the unix prefix in the variable declaration. Instead of having: netns-unix-unix_ctl you will have: netns-unix-ctl agree. Kirill -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][net-2.6.25][NETNS] make netns cleanup to run in a separate workqueue
Ah, sorry. Didn't notice it's called only on boot. Acked-By: Kirill Korotaev [EMAIL PROTECTED] Kirill Korotaev wrote: imho panic() is too much. create_singlethread_workqueue() can fail e.g. due to out of memory... Thanks, Kirill Daniel Lezcano wrote: Subject: make netns cleanup to run in a separate queue From: Benjamin Thery [EMAIL PROTECTED] This patch adds a separate workqueue for cleaning up a network namespace. If we use the keventd workqueue to execute cleanup_net(), there is a problem to unregister devices in IPv6. Indeed the code that cleans up also schedule work in keventd: as long as cleanup_net() hasn't return, dst_gc_task() cannot run and as long as dst_gc_task() has not run, there are still some references pending on the net devices and cleanup_net() can not unregister and exit the keventd workqueue. Signed-off-by: Daniel Lezcano [EMAIL PROTECTED] Signed-off-by: Benjamin Thery [EMAIL PROTECTED] Acked-by: Denis V. Lunev [EMAIL PROTECTED] --- net/core/net_namespace.c |9 - 1 file changed, 8 insertions(+), 1 deletion(-) Index: net-2.6.25/net/core/net_namespace.c === --- net-2.6.25.orig/net/core/net_namespace.c +++ net-2.6.25/net/core/net_namespace.c @@ -58,6 +58,7 @@ out_undo: #ifdef CONFIG_NET_NS static struct kmem_cache *net_cachep; +static struct workqueue_struct *netns_wq; static struct net *net_alloc(void) { @@ -149,7 +150,7 @@ void __put_net(struct net *net) { /* Cleanup the network namespace in process context */ INIT_WORK(net-work, cleanup_net); - schedule_work(net-work); + queue_work(netns_wq, net-work); } EXPORT_SYMBOL_GPL(__put_net); @@ -171,7 +172,13 @@ static int __init net_ns_init(void) net_cachep = kmem_cache_create(net_namespace, sizeof(struct net), SMP_CACHE_BYTES, SLAB_PANIC, NULL); + + /* Create workqueue for cleanup */ + netns_wq = create_singlethread_workqueue(netns); + if (!netns_wq) + panic(Could not create netns workq); #endif + mutex_lock(net_mutex); err = setup_net(init_net); ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [RFD] L2 Network namespace infrastructure
Ben Greear wrote: Kirill Korotaev wrote: Patrick McHardy wrote: I believe OpenVZ stores the current namespace somewhere global, which avoids passing the namespace around. Couldn't you do this as well? yes, we store a global namespace context on current (can be stored in per-cpu as well). do you prefer this way? For what it's worth, I don't prefer this way as I can see wanting to have one application use several namespaces at once As I wrote to you in another email, it's not a problem, since applications itself do not own network namespace. It is objects like sockets which are binded to net namespace and thus you can have sockets from different namespaces in one application regardless of mechanism of context handling we talk about. Thanks, Kirill - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] L2 Network namespace infrastructure
Eric W. Biederman wrote: Patrick McHardy [EMAIL PROTECTED] writes: Eric W. Biederman wrote: -- The basic design There will be a network namespace structure that holds the global variables for a network namespace, making those global variables per network namespace. One of those per network namespace global variables will be the loopback device. Which means the network namespace a packet resides in can be found simply by examining the network device or the socket the packet is traversing. Either a pointer to this global structure will be passed into the functions that need to reference per network namespace variables or a structure that is already passed in (such as the network device) will be modified to contain a pointer to the network namespace structure. I believe OpenVZ stores the current namespace somewhere global, which avoids passing the namespace around. Couldn't you do this as well? It sucks. Especially in the corner cases. Think macvlan with the real network device in one namespace and the ``vlan'' device in another device. sorry, what's the problem here? I don't see a single problem here related to the global current context. The implementation of a global is also pretty a little questionable. Last I looked it didn't work on the transmit path at all and interesting on the receive path. Further and fundamentally all a global achieves is removing the need for the noise patches where you pass the pointer into the various functions. For long term maintenance it doesn't help anything. this is not 100% true. Having global context also helps: 1. to account for time spent in network processing or some other activity (like IRQ handling) to appropriate namespace. 2. it is really usefull and helpfull for functions like printk() with virtualized syslog buffer. Otherwise you would need to find a context in every place where namespace private message should be printed. Actually it is the same natural as the 'current' which is global, but could be passed from entry.S instead to every function which needs it. OVZ experience just tells me that this helps to avoid all this noise and not harder to work with then with 'current'. But it's up to you. Thanks, Kirill - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [RFD] L2 Network namespace infrastructure
Patrick McHardy wrote: Eric W. Biederman wrote: -- The basic design There will be a network namespace structure that holds the global variables for a network namespace, making those global variables per network namespace. One of those per network namespace global variables will be the loopback device. Which means the network namespace a packet resides in can be found simply by examining the network device or the socket the packet is traversing. Either a pointer to this global structure will be passed into the functions that need to reference per network namespace variables or a structure that is already passed in (such as the network device) will be modified to contain a pointer to the network namespace structure. I believe OpenVZ stores the current namespace somewhere global, which avoids passing the namespace around. Couldn't you do this as well? yes, we store a global namespace context on current (can be stored in per-cpu as well). do you prefer this way? Thanks, Kirill - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [RFD] L2 Network namespace infrastructure
Jeff Garzik wrote: Eric W. Biederman wrote: Jeff Garzik [EMAIL PROTECTED] writes: David Miller wrote: I don't accept that we have to add another function argument to a bunch of core routines just to support this crap, especially since you give no way to turn it off and get that function argument slot back. To be honest I think this form of virtualization is a complete waste of time, even the openvz approach. We're protecting the kernel from itself, and that's an endless uphill battle that you will never win. Let's do this kind of stuff properly with a real minimal hypervisor, hopefully with appropriate hardware level support and good virtualized device interfaces, instead of this namespace stuff. Strongly seconded. This containerized virtualization approach just bloats up the kernel for something that is inherently fragile and IMO less secure -- protecting the kernel from itself. Plenty of other virt approaches don't stir the code like this, while simultaneously providing fewer, more-clean entry points for the virtualization to occur. Wrong. I really don't want to get into a my virtualization approach is better then yours. But this is flat out wrong. 99% of the changes I'm talking about introducing are just: - variable + ptr-variable There are more pieces mostly with when we initialize those variables but that is the essence of the change. You completely dodged the main objection. Which is OK if you are selling something to marketing departments, but not OK It is a pure illusion that one kind of virtualization is better then the other one. Look, maybe *hardware* virtualization and a small size of the hypervisor make you feel safer, however you totally forget about all the emulation drivers etc. which can have bugs and security implications as well and maybe triggerable from inside VM. Containers introduce chroot-jail-like features that give one a false sense of security, while still requiring one to poke holes in the illusion to get hardware-specific tasks accomplished. The concepts of users in any OS (and in Linux in particular) give people a false sense of security as well! I know a bunch of examples of how one user can crash/DoS/abuse another one e.g. on RHEL5 or mainstream. But no one have problems with that and no one thinks that multiuserness is a step-backward or maybe someone does?! Yes, there are always bugs and sometimes security implications, but people leave fine with it when it is fixed. The capable/not-capable model (i.e. superuser / normal user) is _still_ being secured locally, even after decades of work and whitepapers and audits. You are drinking Deep Kool-Aid if you think adding containers to the myriad kernel subsystems does anything besides increasing fragility, and decreasing security. You are securing in-kernel subsystems against other in-kernel subsystems. superuser/user model made that difficult enough... now containers add exponential audit complexity to that. Who is to say that a local root does not also pierce the container model? containers do the only thing: make sure that objects from one context are not visible to another one. If containers are not used - everything returns to the case as it is now, i.e. everything is global and globally visible and no auditing overhead at all. So if you are not interested in containers - the code auditing won't be noticably harder for you. And as opposed to other virtualization approaches so far no one has been able to measure the overhead. I suspect there will be a few more cache line misses somewhere but they haven't shown up yet. If the only use was strong isolation which Dave complains about I would concur that the namespace approach is inappropriate. However there are a lot other uses. Sure there are uses. There are uses to putting the X server into the kernel, too. At some point complexity and featuritis has to take a back seat to basic sanity. I agree about sanity, however I totally disagree about complexity you talk about. Bugs we face/fix in Linux kernel which are found with the help of that kind of virtualization makes me believe that Linux kernel only wins from it. Thanks, Kirill - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [PATCH] Virtual ethernet tunnel
David Miller wrote: From: Pavel Emelianov [EMAIL PROTECTED] Date: Wed, 06 Jun 2007 19:11:38 +0400 Veth stands for Virtual ETHernet. It is a simple tunnel driver that works at the link layer and looks like a pair of ethernet devices interconnected with each other. I would suggest choosing a different name. 'veth' is also the name of the virtualized ethernet device found on IBM machines, driven by driver/net/ibmveth.[ch] AFAICS, ibmveth.c registers ethX devices, while this driver registers vethX by default, so there is no much conflict IMHO. But if you still don't like it, I would suggest 'ethtun' and later we will come with iptun for IP tunneling. Is it ok for you? Thanks, Kirill - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Virtual ethernet tunnel
Deniel, Daniel Lezcano wrote: Pavel Emelianov wrote: I did this at the very first version, but Alexey showed me that this would be wrong. Look. When we create the second device it must be in the other namespace as it is useless to have them in one namespace. But if we have the device in the other namespace the RTNL_NEWLINK message from kernel would come into this namespace thus confusing ip utility in the init namespace. Creating the device in the init ns and moving it into the new one is rather a complex task. Pavel, moving the netdevice to another namespace is not a complex task. Eric Biederman did it in its patchset ( cf. http://lxc.sf.net/network ) By saying complex I didn't mean that this is difficult to implement, but that it consists (must consist) of many stages. I.e. composite. Making the device right in the namespace is liter. When the pair device is created, both extremeties are into the init namespace and you can choose to which namespace to move one extremity. I do not mind that. When the network namespace dies, the netdev is moved back to the init namespace. That facilitate network device management. Concerning netlink events, this is automatically generated when the network device is moved through namespaces. IMHO, we should have the network device movement between namespaces in order to be able to move a physical network device too (eg. you have 4 NIC and you want to create 3 containers and assign 3 NIC to each of them) Agree. Moving the devices is a must-have functionality. I do not mind making the pair in the init namespace and move the second one into the desired namespace. But if we *always* will have two ends in different namespaces what to complicate things for? Just to provide a netdev sufficiently generic to be used by people who don't want namespaces but just want to do some network testing, like Ben Greear does. He mentioned in a previous email, he will be happy to stop redirecting people to out of tree patch. https://lists.linux-foundation.org/pipermail/containers/2007-April/004420.html no one is against generic code and ability to create 2 interfaces in *one* namespace. (Like we currently allow to do so in OpenVZ) However, believe me, moving an interface is a *hard* operation. Much harder then netdev register from the scratch. Because it requires to take into account many things like: - packets in flight which requires synchronize and is slow on big machines - asynchronous sysfs entries registration/deregistration from rtln_unlock - netdev_run_todo - name/ifindex collisions - shutdown/cleanup of addresses/routes/qdisc and other similar stuff Thanks, Kirill - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 14/59] sysctl: C99 convert xfs ctl_tables
minor extra space in table below... Kirill From: Eric W. Biederman [EMAIL PROTECTED] - unquoted Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- fs/xfs/linux-2.6/xfs_sysctl.c | 258 1 files changed, 180 insertions(+), 78 deletions(-) diff --git a/fs/xfs/linux-2.6/xfs_sysctl.c b/fs/xfs/linux-2.6/xfs_sysctl.c index af777e9..5a0eefc 100644 --- a/fs/xfs/linux-2.6/xfs_sysctl.c +++ b/fs/xfs/linux-2.6/xfs_sysctl.c @@ -55,95 +55,197 @@ xfs_stats_clear_proc_handler( #endif /* CONFIG_PROC_FS */ STATIC ctl_table xfs_table[] = { - {XFS_RESTRICT_CHOWN, restrict_chown, xfs_params.restrict_chown.val, - sizeof(int), 0644, NULL, proc_dointvec_minmax, - sysctl_intvec, NULL, - xfs_params.restrict_chown.min, xfs_params.restrict_chown.max}, - - {XFS_SGID_INHERIT, irix_sgid_inherit, xfs_params.sgid_inherit.val, - sizeof(int), 0644, NULL, proc_dointvec_minmax, - sysctl_intvec, NULL, - xfs_params.sgid_inherit.min, xfs_params.sgid_inherit.max}, - - {XFS_SYMLINK_MODE, irix_symlink_mode, xfs_params.symlink_mode.val, - sizeof(int), 0644, NULL, proc_dointvec_minmax, - sysctl_intvec, NULL, - xfs_params.symlink_mode.min, xfs_params.symlink_mode.max}, - - {XFS_PANIC_MASK, panic_mask, xfs_params.panic_mask.val, - sizeof(int), 0644, NULL, proc_dointvec_minmax, - sysctl_intvec, NULL, - xfs_params.panic_mask.min, xfs_params.panic_mask.max}, - - {XFS_ERRLEVEL, error_level, xfs_params.error_level.val, - sizeof(int), 0644, NULL, proc_dointvec_minmax, - sysctl_intvec, NULL, - xfs_params.error_level.min, xfs_params.error_level.max}, - - {XFS_SYNCD_TIMER, xfssyncd_centisecs, xfs_params.syncd_timer.val, - sizeof(int), 0644, NULL, proc_dointvec_minmax, - sysctl_intvec, NULL, - xfs_params.syncd_timer.min, xfs_params.syncd_timer.max}, - - {XFS_INHERIT_SYNC, inherit_sync, xfs_params.inherit_sync.val, - sizeof(int), 0644, NULL, proc_dointvec_minmax, - sysctl_intvec, NULL, - xfs_params.inherit_sync.min, xfs_params.inherit_sync.max}, - - {XFS_INHERIT_NODUMP, inherit_nodump, xfs_params.inherit_nodump.val, - sizeof(int), 0644, NULL, proc_dointvec_minmax, - sysctl_intvec, NULL, - xfs_params.inherit_nodump.min, xfs_params.inherit_nodump.max}, - - {XFS_INHERIT_NOATIME, inherit_noatime, xfs_params.inherit_noatim.val, - sizeof(int), 0644, NULL, proc_dointvec_minmax, - sysctl_intvec, NULL, - xfs_params.inherit_noatim.min, xfs_params.inherit_noatim.max}, - - {XFS_BUF_TIMER, xfsbufd_centisecs, xfs_params.xfs_buf_timer.val, - sizeof(int), 0644, NULL, proc_dointvec_minmax, - sysctl_intvec, NULL, - xfs_params.xfs_buf_timer.min, xfs_params.xfs_buf_timer.max}, - - {XFS_BUF_AGE, age_buffer_centisecs, xfs_params.xfs_buf_age.val, - sizeof(int), 0644, NULL, proc_dointvec_minmax, - sysctl_intvec, NULL, - xfs_params.xfs_buf_age.min, xfs_params.xfs_buf_age.max}, - - {XFS_INHERIT_NOSYM, inherit_nosymlinks, xfs_params.inherit_nosym.val, - sizeof(int), 0644, NULL, proc_dointvec_minmax, - sysctl_intvec, NULL, - xfs_params.inherit_nosym.min, xfs_params.inherit_nosym.max}, - - {XFS_ROTORSTEP, rotorstep, xfs_params.rotorstep.val, - sizeof(int), 0644, NULL, proc_dointvec_minmax, - sysctl_intvec, NULL, - xfs_params.rotorstep.min, xfs_params.rotorstep.max}, - - {XFS_INHERIT_NODFRG, inherit_nodefrag, xfs_params.inherit_nodfrg.val, - sizeof(int), 0644, NULL, proc_dointvec_minmax, - sysctl_intvec, NULL, - xfs_params.inherit_nodfrg.min, xfs_params.inherit_nodfrg.max}, + { + .ctl_name = XFS_RESTRICT_CHOWN, + .procname = restrict_chown, + .data = xfs_params.restrict_chown.val, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .strategy = sysctl_intvec, + .extra1 = xfs_params.restrict_chown.min, + .extra2 = xfs_params.restrict_chown.max + }, + { + .ctl_name = XFS_SGID_INHERIT, + .procname = irix_sgid_inherit, + .data = xfs_params.sgid_inherit.val, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .strategy = sysctl_intvec, + .extra1 = xfs_params.sgid_inherit.min, + .extra2 = xfs_params.sgid_inherit.max + }, + { + .ctl_name = XFS_SYMLINK_MODE, + .procname = irix_symlink_mode, + .data = xfs_params.symlink_mode.val, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler =
Re: [PATCH 25/59] sysctl: C99 convert arch/frv/kernel/pm.c
another small minor note. From: Eric W. Biederman [EMAIL PROTECTED] - unquoted Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- arch/frv/kernel/pm.c | 50 +++--- 1 files changed, 43 insertions(+), 7 deletions(-) diff --git a/arch/frv/kernel/pm.c b/arch/frv/kernel/pm.c index c1840d6..aa50333 100644 --- a/arch/frv/kernel/pm.c +++ b/arch/frv/kernel/pm.c @@ -401,17 +401,53 @@ static int cm_sysctl(ctl_table *table, int __user *name, int nlen, static struct ctl_table pm_table[] = { - {CTL_PM_SUSPEND, suspend, NULL, 0, 0200, NULL, sysctl_pm_do_suspend}, - {CTL_PM_CMODE, cmode, clock_cmode_current, sizeof(int), 0644, NULL, cmode_procctl, cmode_sysctl, NULL}, - {CTL_PM_P0, p0, clock_p0_current, sizeof(int), 0644, NULL, p0_procctl, p0_sysctl, NULL}, - {CTL_PM_CM, cm, clock_cm_current, sizeof(int), 0644, NULL, cm_procctl, cm_sysctl, NULL}, - {0} + { + .ctl_name = CTL_PM_SUSPEND, + .procname = suspend, + .data = NULL, + .maxlen = 0, + .mode = 0200, + .proc_handler = sysctl_pm_do_suspend, + }, + { + .ctl_name = CTL_PM_CMODE, + .procname = cmode, + .data = clock_cmode_current, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = cmode_procctl, + .strategy = cmode_sysctl, + }, + { + .ctl_name = CTL_PM_P0, + .procname = p0, + .data = clock_p0_current, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = p0_procctl, + .strategy = p0_sysctl, + }, + { + .ctl_name = CTL_PM_CM, + .procname = cm, + .data = clock_cm_current, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = cm_procctl, + .strategy = cm_sysctl, + }, + { .ctl_name = 0} in next patch (26/59) you write just { }. .ctl_name = 0 not required here. }; static struct ctl_table pm_dir_table[] = { - {CTL_PM, pm, NULL, 0, 0555, pm_table}, - {0} + { + .ctl_name = CTL_PM, + .procname = pm, + .mode = 0555, + .child = pm_table, + }, + { .ctl_name = 0} }; /* - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 33/59] sysctl: s390 move sysctl definitions to sysctl.h
IDs not sorted in enum. see below. From: Eric W. Biederman [EMAIL PROTECTED] - unquoted We need to have the the definition of all top level sysctl directories registers in sysctl.h so we don't conflict by accident and cause abi problems. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- arch/s390/appldata/appldata.h |3 +-- arch/s390/kernel/debug.c |1 - arch/s390/mm/cmm.c|4 include/linux/sysctl.h|7 +++ 4 files changed, 8 insertions(+), 7 deletions(-) diff --git a/arch/s390/appldata/appldata.h b/arch/s390/appldata/appldata.h index 0429481..4069b81 100644 --- a/arch/s390/appldata/appldata.h +++ b/arch/s390/appldata/appldata.h @@ -21,8 +21,7 @@ #define APPLDATA_RECORD_NET_SUM_ID 0x03/* must be 256 ! */ #define APPLDATA_RECORD_PROC_ID 0x04 -#define CTL_APPLDATA 2120/* sysctl IDs, must be unique */ -#define CTL_APPLDATA_TIMER 2121 +#define CTL_APPLDATA_TIMER 2121/* sysctl IDs, must be unique */ #define CTL_APPLDATA_INTERVAL2122 #define CTL_APPLDATA_MEM 2123 #define CTL_APPLDATA_OS 2124 diff --git a/arch/s390/kernel/debug.c b/arch/s390/kernel/debug.c index bb57bc0..c81f8e5 100644 --- a/arch/s390/kernel/debug.c +++ b/arch/s390/kernel/debug.c @@ -852,7 +852,6 @@ debug_finish_entry(debug_info_t * id, debug_entry_t* active, int level, static int debug_stoppable=1; static int debug_active=1; -#define CTL_S390DBF 5677 #define CTL_S390DBF_STOPPABLE 5678 #define CTL_S390DBF_ACTIVE 5679 diff --git a/arch/s390/mm/cmm.c b/arch/s390/mm/cmm.c index 607f50e..df733d5 100644 --- a/arch/s390/mm/cmm.c +++ b/arch/s390/mm/cmm.c @@ -256,10 +256,6 @@ cmm_skip_blanks(char *cp, char **endp) } #ifdef CONFIG_CMM_PROC -/* These will someday get removed. */ -#define VM_CMM_PAGES -#define VM_CMM_TIMED_PAGES 1112 -#define VM_CMM_TIMEOUT 1113 static struct ctl_table cmm_table[]; diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h index 71c16b4..56d0161 100644 --- a/include/linux/sysctl.h +++ b/include/linux/sysctl.h @@ -73,6 +73,8 @@ enum CTL_SUNRPC=7249,/* sunrpc debug */ CTL_PM=9899,/* frv power management */ CTL_FRV=9898, /* frv specific sysctls */ + CTL_S390DBF=5677, /* s390 debug */ + CTL_APPLDATA=2120, /* s390 appldata */ not sorted by ID? imho should be sorted. otherwise can'be unnotied when inserted above. }; /* CTL_BUS names: */ @@ -205,6 +207,11 @@ enum VM_PANIC_ON_OOM=33, /* panic at out-of-memory */ VM_VDSO_ENABLED=34, /* map VDSO into new processes? */ VM_MIN_SLAB=35, /* Percent pages ignored by zone reclaim */ + + /* s390 vm cmm sysctls */ + VM_CMM_PAGES=, + VM_CMM_TIMED_PAGES=1112, + VM_CMM_TIMEOUT=1113, }; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 50/59] sysctl: Move utsname sysctls to their own file
Eric, though I personally don't care much: 1. I ask for not setting your authorship/copyright on the code which you just copied from other places. Just doesn't look polite IMHO. 2. I would propose to not introduce utsname_sysctl.c. both files are too small and minor that I can't see much reasons splitting them. Kirill From: Eric W. Biederman [EMAIL PROTECTED] - unquoted This is just a simple cleanup to keep kernel/sysctl.c from getting to crowded with special cases, and by keeping all of the utsname logic to together it makes the code a little more readable. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- kernel/Makefile |1 + kernel/sysctl.c | 115 - kernel/utsname_sysctl.c | 146 +++ 3 files changed, 147 insertions(+), 115 deletions(-) diff --git a/kernel/Makefile b/kernel/Makefile index 14f4d45..d286c44 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -48,6 +48,7 @@ obj-$(CONFIG_SECCOMP) += seccomp.o obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o obj-$(CONFIG_RELAY) += relay.o obj-$(CONFIG_UTS_NS) += utsname.o +obj-$(CONFIG_SYSCTL) += utsname_sysctl.o obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 7420761..a8c0a03 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -135,13 +135,6 @@ static int parse_table(int __user *, int, void __user *, size_t __user *, void __user *, size_t, ctl_table *); #endif -static int proc_do_uts_string(ctl_table *table, int write, struct file *filp, - void __user *buffer, size_t *lenp, loff_t *ppos); - -static int sysctl_uts_string(ctl_table *table, int __user *name, int nlen, - void __user *oldval, size_t __user *oldlenp, - void __user *newval, size_t newlen); - #ifdef CONFIG_SYSVIPC static int sysctl_ipc_data(ctl_table *table, int __user *name, int nlen, void __user *oldval, size_t __user *oldlenp, @@ -174,27 +167,6 @@ extern ctl_table inotify_table[]; int sysctl_legacy_va_layout; #endif -static void *get_uts(ctl_table *table, int write) -{ - char *which = table-data; -#ifdef CONFIG_UTS_NS - struct uts_namespace *uts_ns = current-nsproxy-uts_ns; - which = (which - (char *)init_uts_ns) + (char *)uts_ns; -#endif - if (!write) - down_read(uts_sem); - else - down_write(uts_sem); - return which; -} - -static void put_uts(ctl_table *table, int write, void *which) -{ - if (!write) - up_read(uts_sem); - else - up_write(uts_sem); -} #ifdef CONFIG_SYSVIPC static void *get_ipc(ctl_table *table, int write) @@ -275,51 +247,6 @@ static ctl_table root_table[] = { static ctl_table kern_table[] = { { - .ctl_name = KERN_OSTYPE, - .procname = ostype, - .data = init_uts_ns.name.sysname, - .maxlen = sizeof(init_uts_ns.name.sysname), - .mode = 0444, - .proc_handler = proc_do_uts_string, - .strategy = sysctl_uts_string, - }, - { - .ctl_name = KERN_OSRELEASE, - .procname = osrelease, - .data = init_uts_ns.name.release, - .maxlen = sizeof(init_uts_ns.name.release), - .mode = 0444, - .proc_handler = proc_do_uts_string, - .strategy = sysctl_uts_string, - }, - { - .ctl_name = KERN_VERSION, - .procname = version, - .data = init_uts_ns.name.version, - .maxlen = sizeof(init_uts_ns.name.version), - .mode = 0444, - .proc_handler = proc_do_uts_string, - .strategy = sysctl_uts_string, - }, - { - .ctl_name = KERN_NODENAME, - .procname = hostname, - .data = init_uts_ns.name.nodename, - .maxlen = sizeof(init_uts_ns.name.nodename), - .mode = 0644, - .proc_handler = proc_do_uts_string, - .strategy = sysctl_uts_string, - }, - { - .ctl_name = KERN_DOMAINNAME, - .procname = domainname, - .data = init_uts_ns.name.domainname, - .maxlen = sizeof(init_uts_ns.name.domainname), - .mode = 0644, - .proc_handler = proc_do_uts_string, - .strategy = sysctl_uts_string, - }, - { .ctl_name = KERN_PANIC, .procname = panic, .data = panic_timeout, @@
Re: [PATCH 51/59] sysctl: Move SYSV IPC sysctls to their own file
1. I ask for not setting your authorship/copyright on the code which you just copied from other places. Just doesn't look polite IMHO. 2. please don't name files like ipc/ipc_sysctl.c ipc/sysctl.c sounds better IMHO. 3. any reason to introduce CONFIG_SYSVIPC_SYSCTL? why not simply do +obj-$(CONFIG_SYSCTL) += sysctl.o instead? Kirill From: Eric W. Biederman [EMAIL PROTECTED] - unquoted This is just a simple cleanup to keep kernel/sysctl.c from getting to crowded with special cases, and by keeping all of the ipc logic to together it makes the code a little more readable. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- init/Kconfig |6 ++ ipc/Makefile |1 + ipc/ipc_sysctl.c | 182 ++ kernel/sysctl.c | 174 --- 4 files changed, 189 insertions(+), 174 deletions(-) diff --git a/init/Kconfig b/init/Kconfig index a3f83e2..33bc38d 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -116,6 +116,12 @@ config SYSVIPC section 6.4 of the Linux Programmer's Guide, available from http://www.tldp.org/guides.html. +config SYSVIPC_SYSCTL + bool + depends on SYSVIPC + depends on SYSCTL + default y + config IPC_NS bool IPC Namespaces depends on SYSVIPC diff --git a/ipc/Makefile b/ipc/Makefile index 0a6d626..b93bba6 100644 --- a/ipc/Makefile +++ b/ipc/Makefile @@ -4,6 +4,7 @@ obj-$(CONFIG_SYSVIPC_COMPAT) += compat.o obj-$(CONFIG_SYSVIPC) += util.o msgutil.o msg.o sem.o shm.o +obj-$(CONFIG_SYSVIPC_SYSCTL) += ipc_sysctl.o obj_mq-$(CONFIG_COMPAT) += compat_mq.o obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o $(obj_mq-y) diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c new file mode 100644 index 000..9018009 --- /dev/null +++ b/ipc/ipc_sysctl.c @@ -0,0 +1,182 @@ +/* + * Copyright (C) 2007 + * + * Author: Eric Biederman [EMAIL PROTECTED] + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation, version 2 of the + * License. + */ + +#include linux/module.h +#include linux/ipc.h +#include linux/nsproxy.h +#include linux/sysctl.h + +#ifdef CONFIG_IPC_NS +static void *get_ipc(ctl_table *table) +{ + char *which = table-data; + struct ipc_namespace *ipc_ns = current-nsproxy-ipc_ns; + which = (which - (char *)init_ipc_ns) + (char *)ipc_ns; + return which; +} +#else +#define get_ipc(T) ((T)-data) +#endif + +#ifdef CONFIG_PROC_FS +static int proc_ipc_dointvec(ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + struct ctl_table ipc_table; + memcpy(ipc_table, table, sizeof(ipc_table)); + ipc_table.data = get_ipc(table); + + return proc_dointvec(ipc_table, write, filp, buffer, lenp, ppos); +} + +static int proc_ipc_doulongvec_minmax(ctl_table *table, int write, + struct file *filp, void __user *buffer, size_t *lenp, loff_t *ppos) +{ + struct ctl_table ipc_table; + memcpy(ipc_table, table, sizeof(ipc_table)); + ipc_table.data = get_ipc(table); + + return proc_doulongvec_minmax(ipc_table, write, filp, buffer, + lenp, ppos); +} + +#else +#define proc_ipc_do_ulongvec_minmax NULL +#define proc_ipc_do_intvec NULL +#endif + +#ifdef CONFIG_SYSCTL_SYSCALL +/* The generic sysctl ipc data routine. */ +static int sysctl_ipc_data(ctl_table *table, int __user *name, int nlen, + void __user *oldval, size_t __user *oldlenp, + void __user *newval, size_t newlen) +{ + size_t len; + void *data; + + /* Get out of I don't have a variable */ + if (!table-data || !table-maxlen) + return -ENOTDIR; + + data = get_ipc(table); + if (!data) + return -ENOTDIR; + + if (oldval oldlenp) { + if (get_user(len, oldlenp)) + return -EFAULT; + if (len) { + if (len table-maxlen) + len = table-maxlen; + if (copy_to_user(oldval, data, len)) + return -EFAULT; + if (put_user(len, oldlenp)) + return -EFAULT; + } + } + + if (newval newlen) { + if (newlen table-maxlen) + newlen = table-maxlen; + + if (copy_from_user(data, newval, newlen)) + return -EFAULT; + } + return 1; +} +#else +#define sysctl_ipc_data NULL +#endif + +static struct ctl_table ipc_kern_table[] = { + { + .ctl_name = KERN_SHMMAX, + .procname = shmmax, + .data =
Re: [PATCH 0/59] Cleanup sysctl
Eric, really good job! Patches: 1-13, 15-24, 26-32, 34-44, 46-49, 52-55, 57 (all except below) Acked-By: Kirill Korotaev [EMAIL PROTECTED] 14/59 - minor (extra space) 25/59 - minor note 33/59 - not sorted sysctl IDs 45/59 - typo 50/59 - copyright/file note 51/59 - copyright/file name/kconfig option notes 56,58,59/59 - will review tomorrow another issue I have to think over is removal of de-owner. Alexey Dobriyan has sent recently patching fixing /proc - modules refcounting. I guess w/o these patches your changes are not safe if proc_handler or strategy are functions from the module. Thanks, Kirill There has not been much maintenance on sysctl in years, and as a result is there is a lot to do to allow future interesting work to happen, and being ambitious I'm trying to do it all at once :) The patches in this series fall into several general categories. - Removal of useless attempts to override the standard sysctls - Registers of sysctl numbers in sysctl.h so someone else does not use the magic number and conflict. - C99 conversions so it becomes possible to change the layout of struct ctl_table without breaking everything. - Removal of useless claims of module ownership, in the proc dir entries - Removal of sys_sysctl support where people had used conflicting sysctl numbers. Trying to break glibc or other applications by changing the ABI is not cool. 9 instances of this in the kernel seems a little extreme. - General enhancements when I got the junk I could see out. Odds are I missed something, most of the cleanups are simply a result of me working on the sysctl core and glancing at the users and going: What? Eric ___ Containers mailing list [EMAIL PROTECTED] https://lists.osdl.org/mailman/listinfo/containers - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: Network virtualization/isolation
If there is a better and less intrusive while still being obvious method I am all for it. I do not like the OpenVZ thing of doing the lookup once and then stashing the value in current and the special casing the exceptions. Why? I like it when things are obvious and not implied. The implementations seems to favor fewer lines of code touched over maintainability of the code. Which if you are maintaining out of tree code is fine. At leas that was my impression last time I looked at the code. FYI, when we started doing networking virtualization many years ago we tried both approaches. Over time, context notion looked much more natural and easier for us. Even Alexey Kuznetsov tells that he prefers exec_env as the logic becomes very clear and little mess is introduced. I know there are a lot of silly things in the existing implementations because they were initially written without the expectation of being able to merge the code into the main kernel. This resulted in some non-general interfaces, and a preference for patches that touch as few lines of code as possible. Sure, but OpenVZ code is being constantly cleaned from such code and we are open for discussion. No one pretends that code is perferct from the beginning. Anyway this has bit has been discussed before and we can discuss it seriously in the context of patch review. Let me explain when explicit context like exec_env IMHO is cleaner: - context is a natural notion of linux kernel. e.g. current. why not pass 'current' to all the functions as an argument starting from entry.S? in_atomic(), in_interrupt() etc. all these functions deal with current context. IMHO when one needs to pass an argument too many times like 'current' it is better to use a notion of the context. - e.g. NFS should set networking context of the mount point or socket. But, ok, it is not the real point to argue so much imho and waste our time instead of doing things. Thanks, Kirill - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [RFC] network namespaces
Herbert Poetzl wrote: my point (until we have an implementation which clearly shows that performance is equal/better to isolation) is simply this: of course, you can 'simulate' or 'construct' all the isolation scenarios with kernel bridging and routing and tricky injection/marking of packets, but, this usually comes with an overhead ... Well, TANSTAAFL*, and pretty much everything comes with an overhead. Multitasking comes with the (scheduler, context switch, CPU cache, etc.) overhead -- is that the reason to abandon it? OpenVZ and Linux-VServer resource management also adds some overhead -- do we want to throw it away? The question is not just equal or better performance, the question is what do we get and how much we pay for it. Equal or better performance is certainly required when we have the code compiled in but aren't using it. We must not penalize the current code. you talk about host system performance. Both approaches do not introduce overhead to host networking. Finally, as I understand both network isolation and network virtualization (both level2 and level3) can happily co-exist. We do have several filesystems in kernel. Let's have several network virtualization approaches, and let a user choose. Is that makes sense? If there are not compelling arguments for using both ways of doing it is silly to merge both, as it is more maintenance overhead. That said I think there is a real chance if we can look at the bind filtering and find a way to express that in the networking stack through iptables. Using the security hooks conflicts with things like selinux. Although it would be interesting to see if selinux can already implement general purpose layer 3 filtering. The more I look the gut feel I have is that the way to proceed would be to add a new table that filters binds, and connects. Plus a new module that would look at a process creating a socket and tell us if it is the appropriate group of processes. With a little care that would be a general solution to the layer 3 filtering problem. Huh, you will still have to insert lots of access checks into different parts of code like RAW sockets, netlinks, protocols which are not inserted, netfilters (to not allow create iptables rules :) ) and many many other places. I see Dave Miller looking at such a patch and my ears hear his rude words :) Thanks, Kirill - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
On Tue, Sep 05, 2006 at 08:45:39AM -0600, Eric W. Biederman wrote: Daniel Lezcano [EMAIL PROTECTED] writes: For HPC if you are interested in migration you need a separate IP per container. If you can take you IP address with you migration of networking state is simple. If you can't take your IP address with you a network container is nearly pointless from a migration perspective. Beyond that from everything I have seen layer 2 is just much cleaner than any layer 3 approach short of Serge's bind filtering. well, the 'ip subset' approach Linux-VServer and other Jail solutions use is very clean, it just does not match your expectations of a virtual interface (as there is none) and it does not cope well with all kinds of per context 'requirements', which IMHO do not really exist on the application layer (only on the whole system layer) I probably expressed that wrong. There are currently three basic approaches under discussion. Layer 3 (Basically bind filtering) nothing at the packet level. The approach taken by Serge's version of bsdjails and Vserver. Layer 2.5 What Daniel proposed. Layer 2. (Trivially mapping each packet to a different interface) And then treating everything as multiple instances of the network stack. Roughly what OpenVZ and I have implemented. I think classifying network virtualization by Layer X is not good enough. OpenVZ has Layer 3 (venet) and Layer 2 (veth) implementations, but in both cases networking stack inside VE remains fully virtualized. Thanks, Kirill - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
Yes, performance is probably one issue. My concerns was for layer 2 / layer 3 virtualization. I agree a layer 2 isolation/virtualization is the best for the system container. But there is another family of container called application container, it is not a system which is run inside a container but only the application. If you want to run a oracle database inside a container, you can run it inside an application container without launching init and all the services. This family of containers are used too for HPC (high performance computing) and for distributed checkpoint/restart. The cluster runs hundred of jobs, spawning them on different hosts inside an application container. Usually the jobs communicates with broadcast and multicast. Application containers does not care of having different MAC address and rely on a layer 3 approach. Are application containers comfortable with a layer 2 virtualization ? I don't think so, because several jobs running inside the same host communicate via broadcast/multicast between them and between other jobs running on different hosts. The IP consumption is a problem too: 1 container == 2 IP (one for the root namespace/ one for the container), multiplicated with the number of jobs. Furthermore, lot of jobs == lot of virtual devices. However, after a discussion with Kirill at the OLS, it appears we can merge the layer 2 and 3 approaches if the level of network virtualization is tunable and we can choose layer 2 or layer 3 when doing the unshare. The determination of the namespace for the incoming traffic can be done with an specific iptable module as a first step. While looking at the network namespace patches, it appears that the TCP/UDP part is **very** similar at what is needed for a layer 3 approach. Any thoughts ? My humble opinion is that your approach doesn't intersect with this one. So we can freely go with both *if needed*. And hear the comments from network guru guys and what and how to improve. So I suggest you at least to send the patches, so we could discuss it. Thanks, Kirill - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] neigh_table_clear() doesn't free stats
neigh_table_clear() doesn't free tbl-stats. Found by Alexey Kuznetsov. Though Alexey considers this leak minor for mainstream, I still believe that cleanup code should not forget to free some of the resources :) At least, this is critical for OpenVZ with virtualized neighbour tables. Signed-Off-By: Kirill Korotaev [EMAIL PROTECTED] diff --git a/net/core/neighbour.c b/net/core/neighbour.c index 89b7904..a45bd21 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -1429,6 +1429,9 @@ int neigh_table_clear(struct neigh_table kfree(tbl-phash_buckets); tbl-phash_buckets = NULL; + free_percpu(tbl-stats); + tbl-stats = NULL; + return 0; } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] network namespaces: playing and debugging
Temporary code to play with network namespaces in the simplest way. Do exec 7 /proc/net/net_ns in your bash shell and you'll get a brand new network namespace. There you can, for example, do ip link set lo up ip addr list ip addr add 1.2.3.4 dev lo ping -n 1.2.3.4 Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] NACK, new /proc interfaces are not acceptable. As you can find from the comment this patch is just for playing with network namespace. Kirill - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
Basically there are currently 3 approaches that have been proposed. The trivial bsdjail style as implemented by Serge and in a slightly more sophisticated version in vserver. This approach as it does not touch the packets has little to no packet level overhead. Basically this is what I have called the Level 3 approach. The more in depth approach where we modify the packet processing based upon which network interface the packet comes in on, and it looks like each namespace has it's own instance of the network stack. Roughly what was proposed earlier in this thread the Level 2 approach. This potentially has per packet overhead so we need to watch the implementation very carefully. Some weird hybrid as proposed by Daniel, that I was never clear on the semantics. The good thing is that these approaches do not contradict each other. We discussed it with Daniel during the summit and as Andrey proposed some shortcuts can be created to avoid double stack traversing. From the previous conversations my impression was that as long as we could get a Layer 2 approach that did not slow down the networking stack and was clean, everyone would be happy. agree. I'm buried in the process id namespace at the moment, and except to be so for the rest of the month, so I'm not going to be very helpful except for a few stray comments. I will be very much obliged if you find some time to review these new patches so that we could make some progress here. Thanks, Kirill - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] limit rt cache size
1) dynamic table growth is the only reasonable way to handle this and not waste memory in all cases Definitely that's the ideal way to go. But there's alot of state to update (more or less atomically, too) in the TCP hashes. Looks tricky to do that without hurting performance, especially since you'll probably want to resize the tables when you've discovered they're full and busy and the memory if fragmented too! :/ Kirill - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] limit rt cache size
2) for cases where we haven't implemented dynamic table growth, specifying a proper limit argument to the hash table allocation is a sufficient solution for the time being Agreed, just we don't know what the proper limits are. I guess it would need someone running quite a lot of benchmarks. Anyone volunteering? @) In my original post I noted how it is quite easy to consume the whole 1Gb of RAM on i686 PC (and it's only 4,194,304 entries) it looks like with more IP addresses it is not that hard to consume much more memory. Or do some cheesy default and document the options to change it clearly and wait for feedback from users on what works for them? Kirill - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] add_timer - mod_timer() in dst_run_gc()
Patch from Dmitry Mishin [EMAIL PROTECTED]: Replace add_timer() by mod_timer() in dst_run_gc in order to avoid BUG message. CPU1CPU2 dst_run_gc() entered dst_run_gc() entered spin_lock(dst_lock) . del_timer(dst_gc_timer) fail to get lock mod_timer() --- puts timer back to the list add_timer(dst_gc_timer) --- BUG because timer is in list already. Found during OpenVZ internal testing. At first we thought that it is OpenVZ specific as we added dst_run_gc(0) call in dst_dev_event(), but as Alexey pointed to me it is possible to trigger this condition in mainstream kernel. F.e. timer has fired on CPU2, but the handler was preeempted by an irq before dst_lock is tried. Meanwhile, someone on CPU1 adds an entry to gc list and starts the timer. If CPU2 was preempted long enough, this timer can expire simultaneously with resuming timer handler on CPU1, arriving exactly to the situation described. Signed-Off-By: Dmitry Mishin [EMAIL PROTECTED] Signed-Off-By: Kirill Korotaev [EMAIL PROTECTED] Signed-Off-By: Alexey Kuznetsov [EMAIL PROTECTED] --- ./net/core/dst.c.dst2006-05-19 13:12:34.0 +0400 +++ ./net/core/dst.c2006-05-22 14:29:50.0 +0400 @@ -95,12 +95,11 @@ static void dst_run_gc(unsigned long dum dst_gc_timer_inc = DST_GC_INC; dst_gc_timer_expires = DST_GC_MIN; } - dst_gc_timer.expires = jiffies + dst_gc_timer_expires; #if RT_CACHE_DEBUG = 2 printk(dst_total: %d/%d %ld\n, atomic_read(dst_total), delayed, dst_gc_timer_expires); #endif - add_timer(dst_gc_timer); + mod_timer(dst_gc_timer, jiffies + dst_gc_timer_expires); out: spin_unlock(dst_lock); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] limit rt cache size
David Miller wrote: we quickly discover this GIT commit: 424c4b70cc4ff3930ee36a2ef7b204e4d704fd26 [IPV4]: Use the fancy alloc_large_system_hash() function for route hash table - rt hash table allocated using alloc_large_system_hash() function. Signed-off-by: Eric Dumazet [EMAIL PROTECTED] Signed-off-by: David S. Miller [EMAIL PROTECTED] And it is clear that old code used num_physpages, which counts low memory only. This shows clearly that Eric's usage of the HASH_HIGHMEM flag here is erroneous. So we should remove it. at least for i686 num_physpages includes highmem, so IMHO this bug was there for years: ./arch/i386/mm/init.c: static void __init set_max_mapnr_init(void) { #ifdef CONFIG_HIGHMEM num_physpages = highend_pfn; #else num_physpages = max_low_pfn; #endif } Look! This thing even uses num_physpages in current code to compute the scale argument to alloc_large_system_hash() :))) the same bug here? :) the good thing is that it only select scale 15 or 17. no any other possible choice here :))) What's about routing cache size, it looks like it is another bug. route.c should not force rt_max_size = 16*rt_hash_size. I think it should consult available memory and to limit rt_max_size to some reasonable value, even if hash size is too high. Sure. This current setting of 16*rt_hash_size is meant to try to limit hash chain lengths I guess. 2.4.x does the same thing. Note also that by basing it upon number of routing cache hash chains, it is effectively consulting available memory. This is why when hash table sizing is crap so it rt_max_size calculation. Fix one and you fix them both :) imho chain lengh limitation to 16 is not that bad, but to avoid such features probably should be fixed :) Once the HASH_HIGHMEM flag is removed, assuming system has 128K of memory, what we get is: hash_chains = lowmem / 128K rt_max_size = ((lowmem / 128K) * 16) == lowmem / 8K So we allow one routing cache entry for each 8K of lowmem we have :) So for now it is probably sufficient to just get rid of the HASH_HIGHMEM flag here. Later we can try changing this multiplier of 16 to something like 8 or even 4. should we remove it for TCP hashes? Thanks, Kirill - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] limit rt cache size
David Miller wrote: From: Andi Kleen [EMAIL PROTECTED] Date: Tue, 8 Aug 2006 08:53:03 +0200 That's still too big. Consider a 2TB machine, with all memory in LOWMEM. Andi I agree with you, route.c should pass in a suitable limit. I'm just suggesting a fix for a seperate problem. So summaring up we have the following issues imho: 1a) rt hash size should be calculated based on lowmem size, not total size 1b) rt hash size should have some upper limit (requested by Andi Kleen for huge mem systems) 2a) max number of rt hash entries should be calculated based on low memory, not as rt_hash_chains*16. 2b) when CONFIG_DEBUG_SLAB and CONFIG_DEBUG_PAGE_ALLOC are ON, 2a) should be corrected taking into account _real_ rtable entry size (one page instead of 256b!!!). 3) should we limit TCP hashe and hashb size the same way? If I haven't missed something I will prepare a patch for 1-2) and a separate patch for 3). Thanks, Kirill - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] Network namespaces: IPv4 FIB/routing in namespaces
Structures related to IPv4 rounting (FIB and routing cache) are made per-namespace. Hi Andrey, if the ressources are private to the namespace, how do you will handle NFS mounted before creating the network namespace ? Do you take care of that or simply assume you can't access NFS anymore ? This is a question that brings up another level of interaction between networking and the rest of kernel code. Solution that I use now makes the NFS communication part always run in the root namespace. This is discussable, of course, but it's a far more complicated matter than just device lists or routing :) if we had containers (not namespaces) then it would be also possible to run NFS in context of the appropriate container and thus each user could mount NFS itself with correct networking context. it's another thing which ties subsytems and makes namespaces ugly :/ Kirill - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][patch 1/4] Network namespaces: cleanup of dev_base list use
Cleanup of dev_base list use, with the aim to make device list per-namespace. In almost every occasion, use of dev_base variable and dev-next pointer could be easily replaced by for_each_netdev loop. A few most complicated places were converted to using first_netdev()/next_netdev(). As a proof of concept patch this is ok. As a real world patch this is much too big, which prevents review. Plus it takes a few actions that are more than replace just iterators through the device list. Mmm, actually it is a whole changeset and should go as a one patch. I didn't find it to be big and my review took only 5-10mins.. I also don't think that mailing each driver maintainer is a good idea. Only if we want to make some buzz :) Kirill - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][patch 1/4] Network namespaces: cleanup of dev_base list use
Cleanup of dev_base list use, with the aim to make device list per-namespace. In almost every occasion, use of dev_base variable and dev-next pointer could be easily replaced by for_each_netdev loop. A few most complicated places were converted to using first_netdev()/next_netdev(). As a proof of concept patch this is ok. As a real world patch this is much too big, which prevents review. Plus it takes a few actions that are more than replace just iterators through the device list. Mmm, actually it is a whole changeset and should go as a one patch. I didn't find it to be big and my review took only 5-10mins.. I also don't think that mailing each driver maintainer is a good idea. Only if we want to make some buzz :) Thanks for supporting my case. You reviewed it and missed the obvious typo. I do agree that a patchset doing it all should happen at once. This doesn't support anything. e.g. I caught quite a lot of bugs after Ingo Molnar, but this doesn't make his code poor. People are people. Anyway, I would be happy to see the typo. As for not mailing the maintainers of the code we are changing. That would just be irresponsible. Kirill - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html