Re: [PATCH 0/12] L2 network namespace (v3)
On Friday 19 January 2007 10:27, Eric W. Biederman wrote: YOSHIFUJI Hideaki / 吉藤英明 [EMAIL PROTECTED] writes: In article [EMAIL PROTECTED] (at Wed, 17 Jan 2007 18:51:14 +0300), Dmitry Mishin [EMAIL PROTECTED] says: === L2 network namespaces The most straightforward concept of network virtualization is complete separation of namespaces, covering device list, routing tables, netfilter tables, socket hashes, and everything else. On input path, each packet is tagged with namespace right from the place where it appears from a device, and is processed by each layer in the context of this namespace. Non-root namespaces communicate with the outside world in two ways: by owning hardware devices, or receiving packets forwarded them by their parent namespace via pass-through device. Can you handle multicast / broadcast and IPv6, which are very important? The basic idea here is very simple. Each network namespace appears to user space as a separate network stack, with it's own set of routing tables etc. All sockets and all network devices (the sources of packets) belong to exactly one network namespace. From the socket or the network device a packet enters the network stack you can infer the network namespace that it will be processed in. Each network namespace should get it own complement of the data structures necessary to process packets, and everything should work. Talking between namespaces is accomplished either through an external network, or through a special pseudo network device. The simplest to implement is two network devices where all packets transmitted on one are received on the other. Then by placing one network device in one namespace and the other in another interface it looks like two machines connected by a cross over cable. Once you have that in a one namespace you can connect other namespaces with the existing ethernet bridging or by configuring one of the namespaces as a router and routing traffic between them. Supporting IPv6 is roughly as difficult as supporting IPv4. What needs to happen to convert code is all variables either need a per network namespace instance or the data structures needs to be modified to have a network namespace tag. For hash tables which are hard to allocate dynamically tagging is the preferred conversion method, for anything that is small enough duplication is preferred as it allows the existing logic to be kept. In the fast path the impact of all of the conversions should be very light, to non-existent. In network stack initialization and cleanup there is work todo because you are initializing and cleanup variables more often then at module insertion and removal. So my expectation is that once we get a framework established and merged to allow network namespaces eventually the entire network stack will be converted. Not just ipv4 and ipv6 but decnet, ipx, iptables, fair scheduling, ethernet bridging and all of the other weird and twisty bits of the linux network stack. Thanks Eric for such descriptive comment. I can only sign off on it :) The primary practical hurdle is there is a lot of networking code in the kernel. I think I know a path by which we can incrementally merge support for network namespaces without breaking anything. More to come on this when I finish up my demonstration patchset in a week or so that is complete enough to show what I am talking about. I hope this helps but the concept into perspective. I'll be waiting it. As for Dmitry's patchset in particular it currently does not support IPv6 and I don't know where it is with respect to the broadcast and multicast but I don't see any immediate problems that would preclude those from working. But any incompleteness is exactly that incompleteness and an implementation problem not a fundamental design issue. Broadcasts/multicasts are supported. -- Thanks, Dmitry. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/12] L2 network namespace (v3): current network namespace operations
On Wednesday 17 January 2007 23:16, Eric W. Biederman wrote: Dmitry Mishin [EMAIL PROTECTED] writes: Added functions and macros required to operate with network namespaces. They are required in order to switch network namespace for incoming packets and to not extend current network interface by additional network namespace argue. Is exec_net only used in interrupt context? I tried to do so. Or how do you ensure a sleeping function does not get called and the kernel process comes back on another cpu? Seems that I forgot to remove it's usage at least in one place - in clone_net_ns(). If you caught more, please, let me know. -- Thanks, Dmitry. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/12] L2 network namespace (v3)
This is an update of L2 network namespaces patches. They are applicable to Cedric's 2.6.20-rc4-mm1-lxc2 tree. Changes: - updated to 2.6.20-rc4-mm1-lxc2 - current network context is per-CPU now - fixed compilation without CONFIG_NET_NS Changed current context definition should fix all mentioned by Cedric issues: - the nsproxy backpointer is unnecessary now - thus removed; - the push_net_ns() and pop_net_ns() use per-CPU variable now; - there is no race on -nsproxy between push_net_ns() and exit_task_namespaces() because they deals with differrent pointers. === L2 network namespaces The most straightforward concept of network virtualization is complete separation of namespaces, covering device list, routing tables, netfilter tables, socket hashes, and everything else. On input path, each packet is tagged with namespace right from the place where it appears from a device, and is processed by each layer in the context of this namespace. Non-root namespaces communicate with the outside world in two ways: by owning hardware devices, or receiving packets forwarded them by their parent namespace via pass-through device. This complete separation of namespaces is very useful for at least two purposes: - allowing users to create and manage by their own various tunnels and VPNs, and - enabling easier and more straightforward live migration of groups of processes with their environment. -- Thanks, Dmitry. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/12] allow proc_dir_entries to have destructor
Destructor field added proc_dir_entries, standard destructor kfree'ing data introduced. Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] --- fs/proc/generic.c | 10 -- fs/proc/root.c |1 + include/linux/proc_fs.h |4 3 files changed, 13 insertions(+), 2 deletions(-) Index: 2.6.20-rc4-mm1/fs/proc/generic.c === --- 2.6.20-rc4-mm1.orig/fs/proc/generic.c +++ 2.6.20-rc4-mm1/fs/proc/generic.c @@ -611,6 +611,11 @@ static struct proc_dir_entry *proc_creat return ent; } +void proc_data_destructor(struct proc_dir_entry *ent) +{ + kfree(ent-data); +} + struct proc_dir_entry *proc_symlink(const char *name, struct proc_dir_entry *parent, const char *dest) { @@ -623,6 +628,7 @@ struct proc_dir_entry *proc_symlink(cons ent-data = kmalloc((ent-size=strlen(dest))+1, GFP_KERNEL); if (ent-data) { strcpy((char*)ent-data,dest); + ent-destructor = proc_data_destructor; if (proc_register(parent, ent) 0) { kfree(ent-data); kfree(ent); @@ -701,8 +707,8 @@ void free_proc_entry(struct proc_dir_ent release_inode_number(ino); - if (S_ISLNK(de-mode) de-data) - kfree(de-data); + if (de-destructor) + de-destructor(de); kfree(de); } Index: 2.6.20-rc4-mm1/fs/proc/root.c === --- 2.6.20-rc4-mm1.orig/fs/proc/root.c +++ 2.6.20-rc4-mm1/fs/proc/root.c @@ -167,6 +167,7 @@ EXPORT_SYMBOL(proc_symlink); EXPORT_SYMBOL(proc_mkdir); EXPORT_SYMBOL(create_proc_entry); EXPORT_SYMBOL(remove_proc_entry); +EXPORT_SYMBOL(proc_data_destructor); EXPORT_SYMBOL(proc_root); EXPORT_SYMBOL(proc_root_fs); EXPORT_SYMBOL(proc_net); Index: 2.6.20-rc4-mm1/include/linux/proc_fs.h === --- 2.6.20-rc4-mm1.orig/include/linux/proc_fs.h +++ 2.6.20-rc4-mm1/include/linux/proc_fs.h @@ -45,6 +45,8 @@ typedef int (read_proc_t)(char *page, ch typedefint (write_proc_t)(struct file *file, const char __user *buffer, unsigned long count, void *data); typedef int (get_info_t)(char *, char **, off_t, int); +struct proc_dir_entry; +typedef void (destroy_proc_t)(struct proc_dir_entry *); struct proc_dir_entry { unsigned int low_ino; @@ -64,6 +66,7 @@ struct proc_dir_entry { read_proc_t *read_proc; write_proc_t *write_proc; atomic_t count; /* use count */ + destroy_proc_t *destructor; int deleted;/* delete flag */ void *set; }; @@ -108,6 +111,7 @@ char *task_mem(struct mm_struct *, char extern struct proc_dir_entry *create_proc_entry(const char *name, mode_t mode, struct proc_dir_entry *parent); extern void remove_proc_entry(const char *name, struct proc_dir_entry *parent); +extern void proc_data_destructor(struct proc_dir_entry *); extern struct vfsmount *proc_mnt; extern int proc_fill_super(struct super_block *,void *,int); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 8/12] net_device seq_file
Library function to create a seq_file in proc filesystem, showing some information for each netdevice. This code is present in the kernel in about 10 instances, and all of them can be converted to using introduced library function. Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] --- include/linux/netdevice.h |7 +++ net/core/dev.c| 96 ++ 2 files changed, 103 insertions(+) --- linux-2.6.20-rc4-mm1.net_ns.orig/include/linux/netdevice.h +++ linux-2.6.20-rc4-mm1.net_ns/include/linux/netdevice.h @@ -604,6 +604,13 @@ extern int register_netdevice(struct ne extern int unregister_netdevice(struct net_device *dev); extern voidfree_netdev(struct net_device *dev); extern voidsynchronize_net(void); +#ifdef CONFIG_PROC_FS +extern int netdev_proc_create(char *name, + int (*show)(struct seq_file *, + struct net_device *, void *), + void *data, struct module *mod); +void netdev_proc_remove(char *name); +#endif extern int register_netdevice_notifier(struct notifier_block *nb); extern int unregister_netdevice_notifier(struct notifier_block *nb); extern int call_netdevice_notifiers(unsigned long val, void *v); --- linux-2.6.20-rc4-mm1.net_ns.orig/net/core/dev.c +++ linux-2.6.20-rc4-mm1.net_ns/net/core/dev.c @@ -2099,6 +2099,102 @@ static int dev_ifconf(char __user *arg) } #ifdef CONFIG_PROC_FS + +struct netdev_proc_data { + struct file_operations fops; + int (*show)(struct seq_file *, struct net_device *, void *); + void *data; +}; + +static void *netdev_proc_seq_start(struct seq_file *seq, loff_t *pos) +{ + struct net_device *dev; + loff_t off; + + read_lock(dev_base_lock); + if (*pos == 0) + return SEQ_START_TOKEN; + for (dev = dev_base, off = 1; dev; dev = dev-next, off++) { + if (*pos == off) + return dev; + } + return NULL; +} + +static void *netdev_proc_seq_next(struct seq_file *seq, void *v, loff_t *pos) +{ + ++*pos; + return (v == SEQ_START_TOKEN) ? dev_base + : ((struct net_device *)v)-next; +} + +static void netdev_proc_seq_stop(struct seq_file *seq, void *v) +{ + read_unlock(dev_base_lock); +} + +static int netdev_proc_seq_show(struct seq_file *seq, void *v) +{ + struct netdev_proc_data *p; + + p = seq-private; + return (*p-show)(seq, v, p-data); +} + +static struct seq_operations netdev_proc_seq_ops = { + .start = netdev_proc_seq_start, + .next = netdev_proc_seq_next, + .stop = netdev_proc_seq_stop, + .show = netdev_proc_seq_show, +}; + +static int netdev_proc_open(struct inode *inode, struct file *file) +{ + int err; + struct seq_file *p; + + err = seq_open(file, netdev_proc_seq_ops); + if (!err) { + p = file-private_data; + p-private = (struct netdev_proc_data *)PDE(inode)-data; + } + return err; +} + +int netdev_proc_create(char *name, + int (*show)(struct seq_file *, struct net_device *, void *), + void *data, struct module *mod) +{ + struct netdev_proc_data *p; + struct proc_dir_entry *ent; + + p = kzalloc(sizeof(*p), GFP_KERNEL); + p-fops.owner = mod; + p-fops.open = netdev_proc_open; + p-fops.read = seq_read; + p-fops.llseek = seq_lseek; + p-fops.release = seq_release; + p-show = show; + p-data = data; + ent = create_proc_entry(name, S_IRUGO, proc_net); + if (ent == NULL) { + kfree(p); + return -EINVAL; + } + ent-data = p; + ent-destructor = proc_data_destructor; + smp_wmb(); + ent-proc_fops = p-fops; + return 0; +} +EXPORT_SYMBOL(netdev_proc_create); + +void netdev_proc_remove(char *name) +{ + proc_net_remove(name); +} +EXPORT_SYMBOL(netdev_proc_remove); + /* * This is invoked by the /proc filesystem handler to display a device * in detail. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 9/12] L2 network namespace (v3): device to pass packets between namespaces
A simple device to pass packets between a namespace and its child. Signed-off-by: Dmitry Mishin [EMAIL PROTECTED] --- drivers/net/Makefile |3 drivers/net/veth.c | 321 +++ net/core/net_namespace.c |1 3 files changed, 325 insertions(+) --- linux-2.6.20-rc4-mm1.net_ns.orig/drivers/net/Makefile +++ linux-2.6.20-rc4-mm1.net_ns/drivers/net/Makefile @@ -125,6 +125,9 @@ obj-$(CONFIG_SLIP) += slip.o obj-$(CONFIG_SLHC) += slhc.o obj-$(CONFIG_DUMMY) += dummy.o +ifeq ($(CONFIG_NET_NS),y) +obj-m += veth.o +endif obj-$(CONFIG_IFB) += ifb.o obj-$(CONFIG_DE600) += de600.o obj-$(CONFIG_DE620) += de620.o --- /dev/null +++ linux-2.6.20-rc4-mm1.net_ns/drivers/net/veth.c @@ -0,0 +1,321 @@ +/* + * Copyright (C) 2006 SWsoft + * + * Written by Andrey Savochkin [EMAIL PROTECTED], + * reusing code by Andrey Mirkin [EMAIL PROTECTED]. + */ +#include linux/list.h +#include linux/spinlock.h +#include linux/ctype.h +#include asm/semaphore.h +#include linux/netdevice.h +#include linux/etherdevice.h +#include linux/proc_fs.h +#include linux/seq_file.h +#include net/dst.h +#include net/xfrm.h + +struct veth_struct +{ + struct net_device *pair; + struct net_device_stats stats; +}; + +#define veth_from_netdev(dev) ((struct veth_struct *)(netdev_priv(dev))) + +/* --- * + * + * Device functions + * + * --- */ + +static struct net_device_stats *get_stats(struct net_device *dev); +static int veth_xmit(struct sk_buff *skb, struct net_device *dev) +{ + struct net_device_stats *stats; + struct veth_struct *entry; + struct net_device *rcv; + struct net_namespace *orig_net_ns; + int length; + + stats = get_stats(dev); + entry = veth_from_netdev(dev); + rcv = entry-pair; + + if (!(rcv-flags IFF_UP)) + /* Target namespace does not want to receive packets */ + goto outf; + + dst_release(skb-dst); + skb-dst = NULL; + secpath_reset(skb); + skb_orphan(skb); + nf_reset(skb); + + orig_net_ns = push_net_ns(rcv-net_ns); + skb-dev = rcv; + skb-pkt_type = PACKET_HOST; + skb-protocol = eth_type_trans(skb, rcv); + + length = skb-len; + stats-tx_bytes += length; + stats-tx_packets++; + stats = get_stats(rcv); + stats-rx_bytes += length; + stats-rx_packets++; + + netif_rx(skb); + pop_net_ns(orig_net_ns); + return 0; + +outf: + stats-tx_dropped++; + kfree_skb(skb); + return 0; +} + +static int veth_open(struct net_device *dev) +{ + return 0; +} + +static int veth_close(struct net_device *dev) +{ + return 0; +} + +static void veth_destructor(struct net_device *dev) +{ + free_netdev(dev); +} + +static struct net_device_stats *get_stats(struct net_device *dev) +{ + return veth_from_netdev(dev)-stats; +} + +int veth_init_dev(struct net_device *dev) +{ + dev-hard_start_xmit = veth_xmit; + dev-open = veth_open; + dev-stop = veth_close; + dev-destructor = veth_destructor; + dev-get_stats = get_stats; + + ether_setup(dev); + + dev-tx_queue_len = 0; + return 0; +} + +static void veth_setup(struct net_device *dev) +{ + dev-init = veth_init_dev; +} + +static inline int is_veth_dev(struct net_device *dev) +{ + return dev-init == veth_init_dev; +} + +/* --- * + * + * Management interface + * + * --- */ + +struct net_device *veth_dev_alloc(char *name, char *addr) +{ + struct net_device *dev; + + dev = alloc_netdev(sizeof(struct veth_struct), name, veth_setup); + if (dev != NULL) { + memcpy(dev-dev_addr, addr, ETH_ALEN); + dev-addr_len = ETH_ALEN; + } + return dev; +} + +int veth_entry_add(char *parent_name, char *parent_addr, + struct net_namespace *parent_ns, char *child_name, char *child_addr, + struct net_namespace *child_ns) +{ + struct net_device *parent_dev, *child_dev; + int err; + + err = -ENOMEM; + if ((parent_dev = veth_dev_alloc(parent_name, parent_addr)) == NULL) + goto out_alocp; + if ((child_dev = veth_dev_alloc(child_name, child_addr)) == NULL) + goto out_alocc; + veth_from_netdev(parent_dev)-pair = child_dev; + veth_from_netdev(child_dev)-pair = parent_dev; + + /* +* About serialization, see comments to veth_pair_del(). +*/ + rtnl_lock(); + /* refcounts should be already upped, so, just put old ones */ + put_net_ns(parent_dev-net_ns); + parent_dev-net_ns = parent_ns; + if ((err = register_netdevice
[PATCH 10/12] L2 network namespace (v3): playing with pass-through device
Temporary code to debug and play with pass-through device. Create device pair by modprobe veth echo 'add veth1 0:1:2:3:4:1 eth0 0:1:2:3:4:2' /proc/net/veth_ctl and your shell will appear into a new namespace with `eth0' device. Configure device in this namespace ip l s eth0 up ip a a 1.2.3.4/24 dev eth0 and in the root namespace ip l s veth1 up ip a a 1.2.3.1/24 dev veth1 to establish a communication channel between root namespace and the newly created one. Code is done by Andrey Savochkin and ported by me over Cedric'c patchset Signed-off-by: Dmitry Mishin [EMAIL PROTECTED] --- drivers/net/veth.c | 121 +++ fs/proc/array.c |8 +++ kernel/fork.c|1 kernel/nsproxy.c |1 net/core/net_namespace.c |3 + 5 files changed, 134 insertions(+) --- linux-2.6.20-rc4-mm1.net_ns.orig/drivers/net/veth.c +++ linux-2.6.20-rc4-mm1.net_ns/drivers/net/veth.c @@ -12,6 +12,7 @@ #include linux/etherdevice.h #include linux/proc_fs.h #include linux/seq_file.h +#include linux/syscalls.h #include net/dst.h #include net/xfrm.h @@ -245,6 +246,123 @@ void veth_entry_del_all(void) /* --- * * + * Temporary interface to create veth devices + * + * --- */ + +#ifdef CONFIG_PROC_FS + +static int veth_debug_open(struct inode *inode, struct file *file) +{ + return 0; +} + +static char *parse_addr(char *s, char *addr) +{ + int i, v; + + for (i = 0; i ETH_ALEN; i++) { + if (!isxdigit(*s)) + return NULL; + *addr = 0; + v = isdigit(*s) ? *s - '0' : toupper(*s) - 'A' + 10; + s++; + if (isxdigit(*s)) { + *addr += v 16; + v = isdigit(*s) ? *s - '0' : toupper(*s) - 'A' + 10; + s++; + } + *addr++ += v; + if (i ETH_ALEN - 1 ispunct(*s)) + s++; + } + return s; +} + +static ssize_t veth_debug_write(struct file *file, const char __user *user_buf, + size_t size, loff_t *ppos) +{ + char buf[128], *s, *parent_name, *child_name; + char parent_addr[ETH_ALEN], child_addr[ETH_ALEN]; + struct net_namespace *parent_ns, *child_ns; + int err; + + s = buf; + err = -EINVAL; + if (size = sizeof(buf)) + goto out; + err = -EFAULT; + if (copy_from_user(buf, user_buf, size)) + goto out; + buf[size] = 0; + + err = -EBADRQC; + if (!strncmp(buf, add , 4)) { + parent_name = buf + 4; + if ((s = strchr(parent_name, ' ')) == NULL) + goto out; + *s = 0; + if ((s = parse_addr(s + 1, parent_addr)) == NULL) + goto out; + if (!*s) + goto out; + child_name = s + 1; + if ((s = strchr(child_name, ' ')) == NULL) + goto out; + *s = 0; + if ((s = parse_addr(s + 1, child_addr)) == NULL) + goto out; + + get_net_ns(current_net_ns); + parent_ns = current_net_ns; + if (*s == ' ') { + unsigned int id; + id = simple_strtoul(s + 1, s, 0); + err = sys_bind_ns(id, NS_ALL); + } else + err = sys_unshare(CLONE_NEWNET2); + if (err) + goto out; + /* after bind_ns() or unshare_ns() namespace is changed */ + get_net_ns(current_net_ns); + child_ns = current_net_ns; + err = veth_entry_add(parent_name, parent_addr, parent_ns, + child_name, child_addr, child_ns); + if (err) { + put_net_ns(child_ns); + put_net_ns(parent_ns); + } else + err = size; + } +out: + return err; +} + +static struct file_operations veth_debug_ops = { + .open = veth_debug_open, + .write = veth_debug_write, +}; + +static int veth_debug_create(void) +{ + proc_net_fops_create(veth_ctl, 0200, veth_debug_ops); + return 0; +} + +static void veth_debug_remove(void) +{ + proc_net_remove(veth_ctl); +} + +#else + +static int veth_debug_create(void) { return -1; } +static void veth_debug_remove(void) { } + +#endif + +/* --- * + * * Information in proc * * --- */ @@ -304,12 +422,15 @@ static inline void veth_proc_remove(void int __init veth_init
[PATCH 11/12] L2 network namespace (v3): sockets proc view virtualization
Only current net namespace sockets or all sockets in case of init_net_ns should be visible through proc interface. Signed-off-by: Dmitry Mishin [EMAIL PROTECTED] --- include/net/af_unix.h | 21 + net/ipv4/tcp_ipv4.c |9 + net/ipv4/udp.c| 13 +++-- 3 files changed, 37 insertions(+), 6 deletions(-) --- linux-2.6.20-rc4-mm1.net_ns.orig/include/net/af_unix.h +++ linux-2.6.20-rc4-mm1.net_ns/include/net/af_unix.h @@ -19,9 +19,13 @@ extern atomic_t unix_tot_inflight; static inline struct sock *first_unix_socket(int *i) { + struct sock *sk; + for (*i = 0; *i = UNIX_HASH_SIZE; (*i)++) { - if (!hlist_empty(unix_socket_table[*i])) - return __sk_head(unix_socket_table[*i]); + for (sk = sk_head(unix_socket_table[*i]); sk; sk = sk_next(sk)) + if (net_ns_match(sk-sk_net_ns, current_net_ns) || + net_ns_match(current_net_ns, init_net_ns)) + return sk; } return NULL; } @@ -32,10 +36,19 @@ static inline struct sock *next_unix_soc /* More in this chain? */ if (next) return next; + for (; next != NULL; next = sk_next(next)) { + if (!net_ns_match(next-sk_net_ns, current_net_ns) + !net_ns_match(current_net_ns, init_net_ns)) + continue; + return next; + } /* Look for next non-empty chain. */ for ((*i)++; *i = UNIX_HASH_SIZE; (*i)++) { - if (!hlist_empty(unix_socket_table[*i])) - return __sk_head(unix_socket_table[*i]); + for (next = sk_head(unix_socket_table[*i]); next; + next = sk_next(next)) + if (net_ns_match(next-sk_net_ns, current_net_ns) || + net_ns_match(current_net_ns, init_net_ns)) + return next; } return NULL; } --- linux-2.6.20-rc4-mm1.net_ns.orig/net/ipv4/tcp_ipv4.c +++ linux-2.6.20-rc4-mm1.net_ns/net/ipv4/tcp_ipv4.c @@ -1992,6 +1992,9 @@ get_req: } get_sk: sk_for_each_from(sk, node) { + if (!net_ns_match(sk-sk_net_ns, current_net_ns) + !net_ns_match(current_net_ns, init_net_ns)) + continue; if (sk-sk_family == st-family) { cur = sk; goto out; @@ -2043,6 +2046,9 @@ static void *established_get_first(struc read_lock(tcp_hashinfo.ehash[st-bucket].lock); sk_for_each(sk, node, tcp_hashinfo.ehash[st-bucket].chain) { + if (!net_ns_match(sk-sk_net_ns, current_net_ns) + !net_ns_match(current_net_ns, init_net_ns)) + continue; if (sk-sk_family != st-family) { continue; } @@ -2102,6 +2108,9 @@ get_tw: sk = sk_next(sk); sk_for_each_from(sk, node) { + if (!net_ns_match(sk-sk_net_ns, current_net_ns) + !net_ns_match(current_net_ns, init_net_ns)) + continue; if (sk-sk_family == st-family) goto found; } --- linux-2.6.20-rc4-mm1.net_ns.orig/net/ipv4/udp.c +++ linux-2.6.20-rc4-mm1.net_ns/net/ipv4/udp.c @@ -1549,6 +1549,9 @@ static struct sock *udp_get_first(struct for (state-bucket = 0; state-bucket UDP_HTABLE_SIZE; ++state-bucket) { struct hlist_node *node; sk_for_each(sk, node, state-hashtable + state-bucket) { + if (!net_ns_match(sk-sk_net_ns, current_net_ns) + !net_ns_match(current_net_ns, init_net_ns)) + continue; if (sk-sk_family == state-family) goto found; } @@ -1565,8 +1568,14 @@ static struct sock *udp_get_next(struct do { sk = sk_next(sk); try_again: - ; - } while (sk sk-sk_family != state-family); + if (!sk) + break; + if (sk-sk_family != state-family) + continue; + if (net_ns_match(sk-sk_net_ns, current_net_ns) || + net_ns_match(current_net_ns, init_net_ns)) + break; + } while (1); if (!sk ++state-bucket UDP_HTABLE_SIZE) { sk = sk_head(state-hashtable + state-bucket); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 12/12] L2 network namespace (v3): L3 network namespace intro
Inroduce two kind of network namespaces - level 2 and level 3. First one is namespace with full set of networking objects, while second one - socket-level with restricted set. Signed-off-by: Dmitry Mishin [EMAIL PROTECTED] --- include/linux/net_namespace.h |3 +++ net/core/net_namespace.c | 40 2 files changed, 31 insertions(+), 12 deletions(-) --- linux-2.6.20-rc4-mm1.net_ns.orig/include/linux/net_namespace.h +++ linux-2.6.20-rc4-mm1.net_ns/include/linux/net_namespace.h @@ -24,6 +24,9 @@ struct net_namespace { int fib4_trie_last_dflt; #endif unsigned inthash; +#define NET_NS_LEVEL2 1 +#define NET_NS_LEVEL3 2 + unsigned intlevel; }; extern struct net_namespace init_net_ns; --- linux-2.6.20-rc4-mm1.net_ns.orig/net/core/net_namespace.c +++ linux-2.6.20-rc4-mm1.net_ns/net/core/net_namespace.c @@ -30,13 +30,19 @@ EXPORT_PER_CPU_SYMBOL_GPL(exec_net_ns); /* * Clone a new ns copying an original net ns, setting refcount to 1 + * @level: level of namespace to create * @old_ns: namespace to clone - * Return NULL on error (failure to kmalloc), new ns otherwise + * Return ERR_PTR on error, new ns otherwise */ -static struct net_namespace *clone_net_ns(struct net_namespace *old_ns) +static struct net_namespace *clone_net_ns(unsigned int level, + struct net_namespace *old_ns) { struct net_namespace *ns; + /* level 3 namespaces are incomplete in order to have childs */ + if (current_net_ns-level == NET_NS_LEVEL3) + return ERR_PTR(-EPERM); + ns = kzalloc(sizeof(struct net_namespace), GFP_KERNEL); if (!ns) return NULL; @@ -48,20 +54,25 @@ static struct net_namespace *clone_net_n if ((push_net_ns(ns)) != old_ns) BUG(); + if (level == NET_NS_LEVEL2) { #ifdef CONFIG_IP_MULTIPLE_TABLES - INIT_LIST_HEAD(ns-fib_rules_ops_list); + INIT_LIST_HEAD(ns-fib_rules_ops_list); #endif - if (ip_fib_struct_init()) - goto out_fib4; + if (ip_fib_struct_init()) + goto out_fib4; + } + ns-level = level; if (loopback_init()) goto out_loopback; pop_net_ns(old_ns); - printk(KERN_DEBUG NET_NS: created new netcontext %p for %s - (pid=%d)\n, ns, current-comm, current-tgid); + printk(KERN_DEBUG NET_NS: created new netcontext %p, level %u, + for %s (pid=%d)\n, ns, (ns-level == NET_NS_LEVEL2) ? + 2 : 3, current-comm, current-tgid); return ns; out_loopback: - ip_fib_struct_cleanup(ns); + if (level == NET_NS_LEVEL2) + ip_fib_struct_cleanup(ns); out_fib4: pop_net_ns(old_ns); BUG_ON(atomic_read(ns-kref.refcount) != 1); @@ -75,13 +86,17 @@ out_fib4: int unshare_net_ns(unsigned long unshare_flags, struct net_namespace **new_net) { + unsigned int level; + if (unshare_flags (CLONE_NEWNET2|CLONE_NEWNET3)) { if (!capable(CAP_SYS_ADMIN)) return -EPERM; - *new_net = clone_net_ns(current-nsproxy-net_ns); - if (!*new_net) - return -ENOMEM; + level = (unshare_flags CLONE_NEWNET2) ? NET_NS_LEVEL2 : + NET_NS_LEVEL3; + *new_net = clone_net_ns(level, current-nsproxy-net_ns); + if (IS_ERR(*new_net)) + return PTR_ERR(*new_net); } return 0; @@ -110,7 +125,8 @@ void free_net_ns(struct kref *kref) ns, atomic_read(ns-kref.refcount)); return; } - ip_fib_struct_cleanup(ns); + if (ns-level == NET_NS_LEVEL2) + ip_fib_struct_cleanup(ns); printk(KERN_DEBUG NET_NS: net namespace %p destroyed\n, ns); kfree(ns); } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: Network virtualization/isolation
On Saturday 09 December 2006 09:35, Herbert Poetzl wrote: On Fri, Dec 08, 2006 at 10:13:48PM -0800, Andrew Morton wrote: On Sat, 9 Dec 2006 04:50:02 +0100 Herbert Poetzl [EMAIL PROTECTED] wrote: On Fri, Dec 08, 2006 at 12:57:49PM -0700, Eric W. Biederman wrote: Herbert Poetzl [EMAIL PROTECTED] writes: But, ok, it is not the real point to argue so much imho and waste our time instead of doing things. well, IMHO better talk (and think) first, then implement something ... not the other way round, and then start fixing up the mess ... Well we need a bit of both. hmm, are 'we' in a hurry here? until recently, 'Linux' (mainline) didn't even want to hear about OS Level virtualization, now there is a rush to quickly get 'something' in, not knowing or caring if it is usable at all? It's actually happening quite gradually and carefully. hmm, I must have missed a testing phase for the IPC namespace then, not that I think it is broken (well, maybe it is, we do not know yet) Herbert, you know that this code is used in our product. And in its turn, our product is tested internally and by a community. We have no reports about bugs in this code. If you have to say more than just something to say, please, say it. I think there are a lot of 'potential users' for this kind of virtualization, and so 'we' can test almost all aspects outside of mainline, and once we know the stuff works as expected, then we can integrate it ... the UTS namespace was something 'we all' had already implemented in this (or a very similar) way, and in one or two interations, it should actually work as expected. nevertheless, it was one of the simplest spaces ... we do not yet know the details for the IPC namespace, as IPC is not that easy to check as UTS, and 'we' haven't gotten real world feedback on that yet ... We are very dependent upon all stakeholders including yourself to review, test and comment upon this infrastructure as it is proposed and merged. If something is proposed which will not suit your requirements then it is important that we hear about it, in detail, at the earliest possible time. okay, good to hear that I'm still considered a stakeholder will try to focus the feedback and cc as many folks as possible, as it seems that some feedback is lost on the way upstream ... best, Herbert Thanks. -- Thanks, Dmitry. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network virtualization/isolation
On Sunday 03 December 2006 19:00, Eric W. Biederman wrote: Ok. Just a quick summary of where I see the discussion. We all agree that L2 isolation is needed at some point. As we all agreed on this, may be it is time to send patches one-by-one? For the beggining, I propose to resend Cedric's empty namespace patch as base for others - it is really empty, but necessary in order to move further. After this patch and the following net namespace unshare patch will be accepted, I could send network devices virtualization patches for review and discussion. What do you think? The approaches discussed for L2 and L3 are sufficiently orthogonal that we can implement then in either order. You would need to unshare L3 to unshare L2, but if we think of them as two separate namespaces we are likely to be in better shape. The L3 discussion still has the problem that there has not been agreement on all of the semantics yet. More comments after I get some sleep. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Thanks, Dmitry. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network virtualization/isolation
On Monday 04 December 2006 18:35, Eric W. Biederman wrote: [skip] Where and when you look to find the network namespace that applies to a packet is the primary difference between the OpenVZ L2 implementation and my L2 implementation. If there is a better and less intrusive while still being obvious method I am all for it. I do not like the OpenVZ thing of doing the lookup once and then stashing the value in current and the special casing the exceptions. Why? -- Thanks, Dmitry. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network virtualization/isolation
On Monday 04 December 2006 19:43, Herbert Poetzl wrote: On Mon, Dec 04, 2006 at 06:19:00PM +0300, Dmitry Mishin wrote: On Sunday 03 December 2006 19:00, Eric W. Biederman wrote: Ok. Just a quick summary of where I see the discussion. We all agree that L2 isolation is needed at some point. As we all agreed on this, may be it is time to send patches one-by-one? For the beggining, I propose to resend Cedric's empty namespace patch as base for others - it is really empty, but necessary in order to move further. After this patch and the following net namespace unshare patch will be accepted, well, I have neither seen any performance tests showing that the following is true: - no change on network performance without the space enabled - no change on network performance on the host with the network namespaces enabled - no measureable overhead inside the network namespace - good scaleability for a larger number of network namespaces These questions are for complete L2 implementation, not for these 2 empty patches. If you need some data relating to Andrey's implementation, I'll get it. Which test do you accept? I could send network devices virtualization patches for review and discussion. that won't hurt ... best, Herbert What do you think? The approaches discussed for L2 and L3 are sufficiently orthogonal that we can implement then in either order. You would need to unshare L3 to unshare L2, but if we think of them as two separate namespaces we are likely to be in better shape. The L3 discussion still has the problem that there has not been agreement on all of the semantics yet. More comments after I get some sleep. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Thanks, Dmitry. ___ Containers mailing list [EMAIL PROTECTED] https://lists.osdl.org/mailman/listinfo/containers -- Thanks, Dmitry. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] add ndisc_netdev_notifier unregister
If inet6_init() fails later than ndisc_init() call, or IPv6 module is unloaded, ndisc_netdev_notifier call remains in the list and will follows in oops later. Signed-off-by: Dmitry Mishin [EMAIL PROTECTED] --- ndisc.c |1 + 1 file changed, 1 insertion(+) --- diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c index 41a8a5f..73eb8c3 100644 --- a/net/ipv6/ndisc.c +++ b/net/ipv6/ndisc.c @@ -1742,6 +1742,7 @@ #endif void ndisc_cleanup(void) { + unregister_netdevice_notifier(ndisc_netdev_notifier); #ifdef CONFIG_SYSCTL neigh_sysctl_unregister(nd_tbl.parms); #endif - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Bridge it's MAC address question
Hi, Could somebody explain, why bridge uses minimal MAC of the attached devices? It makes this address instable, variable during bridge life-cycle, which is not good for DHCP. For example, I want to attach multiple virtual devices to one physical. Then, I need to make sure that after each virtual device addition, bridge addr is not changed and still addr of the physical device. Why not to use MAC of the first attached device? -- Thanks, Dmitry. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network virtualization/isolation
On Thursday 26 October 2006 19:56, Stephen Hemminger wrote: On Thu, 26 Oct 2006 11:44:55 +0200 Daniel Lezcano [EMAIL PROTECTED] wrote: Stephen Hemminger wrote: On Wed, 25 Oct 2006 17:51:28 +0200 Daniel Lezcano [EMAIL PROTECTED] wrote: Hi Stephen, currently the work to make the container enablement into the kernel is doing good progress. The ipc, pid, utsname and filesystem system ressources are isolated/virtualized relying on the namespaces concept. But, there is missing the network virtualization/isolation. Two approaches are proposed: doing the isolation at the layer 2 and at the layer 3. The first one instanciate a network device by namespace and add a peer network device into the root namespace, all the routing ressources are relative to the namespace. This work is done by Andrey Savochkin from the openvz project. The second relies on the routes and associates the network namespace pointer with each route. When the traffic is incoming, the packet follows an input route and retrieve the associated network namespace. When the traffic is outgoing, the packet, identified from the network namespace is coming from, follows only the routes matching the same network namespace. This work is made by me. IMHO, we need the two approach, the layer-2 to be able to bring *very* strong isolation for system container with a performance cost and a layer-3 to be able to have good isolation for lightweight container or application container when performances are more important. Do you have some suggestions ? What is your point of view on that ? Thanks in advance. -- Daniel Any solution should allow both and it should build on the existing netfilter infrastructure. The problem is netfilter can not give a good isolation, eg. how can be handled netstat command ? or avoid to see IP addresses assigned to another container when doing ifconfig ? Furthermore, one of the biggest interest of the network isolation is to bring mobility with a container and that can only be done if the network ressources inside the kernel can be identified by container in order to checkpoint/restart them. The all-in-namespace solution, ie. at layer 2, is very good in terms of isolation but it adds an non-negligeable overhead. The layer 3 isolation has an insignifiant overhead, a good isolation perfectly adapted for applications containers. Unfortunatly, from the point of view of implementation, layer 3 can not be a subset of layer 2 isolation when using all-in-namespace and layer 2 isolation can not be a extension of the layer 3 isolation. I think the layer 2 and the layer 3 implementations can coexists. You can for example create a system container with a layer 2 isolation and inside it add a layer 3 isolation. Does that make sense ? -- Daniel Assuming you are talking about pseudo-virtualized environments, there are several different discussions. 1. How should the namespace be isolated for the virtualized containered applications? 2. How should traffic be restricted into/out of those containers. This is where existing netfilter, classification, etc, should be used. The network code is overly rich as it is, we don't need another abstraction. 3. Can the virtualized containers be secure? No. we really can't keep hostile root in a container from killing system without going to a hypervisor. Stephen, Virtualized container can be secure, if it is complete system virtualization, not just an application container. OpenVZ implements such and it is used hard over the world. And of course, we care a lot to keep hostile root from killing whole system. OpenVZ uses virtualization on IP level (implemented by Andrey Savochkin, http://marc.theaimsgroup.com/?l=linux-netdevm=115572448503723), with all necessary network objects isolated/virtualized, such as sockets, devices, routes, netfilters, etc. -- Thanks, Dmitry. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
Sorry, dont' understand your proposal correctly from the previous talk. :) But... On Tuesday 12 September 2006 07:28, Eric W. Biederman wrote: Do you have some concrete arguments against the proposal? Yes, I have. I think it is unnecessary complication. This complication will followed in additional bugs. Especially if we'll accept rules creation in userspace. Why we need complex solution, if there are only two approaches to socket bound - isolation and virtualization? These approaches could co-exist without hooks. Or you probably have thoughts about other ways? -- Thanks, Dmitry. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [RFC] network namespaces
On Monday 11 September 2006 18:57, Herbert Poetzl wrote: I completely agree here, we need a separate namespace for that, so that we can combine isolation and virtualization as needed, unless the bind restrictions can be completely expressed with an additional mangle or filter table (as was suggested) iptables are designed for packet flow decisions and filtering, it has nothing common with bind restrictions. So, it may be only packet flow scheduling/filtering, but it will not help to resolve bind-time IP conflicts. -- Thanks, Dmitry. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [RFC] network namespaces
On Sunday 10 September 2006 06:47, Herbert Poetzl wrote: well, I think it would be best to have both, as they are complementary to some degree, and IMHO both, the full virtualization _and_ the isolation will require a separate namespace to work, [snip] I do not think that folks would want to recompile their kernel just to get a light-weight guest or a fully virtualized one In this case light-weight guest will have unnecessary overhead. For example, instead of using static pointer, we have to find the required common namespace before. And there will be no advantages for such guest over full-featured. best, Herbert -- Thanks, Dmitry. -- Thanks, Dmitry. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [RFC] network namespaces
On Sunday 10 September 2006 07:41, Eric W. Biederman wrote: I certainly agree that we are not at a point where a final decision can be made. A major piece of that is that a layer 2 approach has not shown to be without a performance penalty. But it is required. Why to limit possible usages? A practical question. Do the IPs assigned to guests ever get used by anything besides the guest? In case of level2 virtualization - no. -- Thanks, Dmitry. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [RFC] network namespaces
On Friday 08 September 2006 22:11, Herbert Poetzl wrote: actually the light-weight ip isolation runs perfectly fine _without_ CAP_NET_ADMIN, as you do not want the guest to be able to mess with the 'configured' ips at all (not to speak of interfaces here) It was only an example. I'm thinking about how to implement flexible solution, which permits light-weight ip isolation as well as full-fledged netwrok virtualization. Another solution is to split CONFIG_NET_NAMESPACE. Is it good for you? -- Thanks, Dmitry. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [RFC] network namespaces
On Thursday 07 September 2006 21:27, Herbert Poetzl wrote: well, who said that you need to have things like RAW sockets or other protocols except IP, not to speak of iptable and routing entries ... folks who _want_ full network virtualization can use the more complete virtual setup and be happy ... Let's think about how to implement this. As I understood VServer's design, your proposal is to split CAP_NET_ADMIN to multiple capabilities and use them if required. So, for your light-weight container it is enough to implement context isolation for protected by CAP_NET_IP capability (for example) code and put 'if (!capable(CAP_NET_*))' checks to all other places. But this could be easily implemented over OpenVZ code by CAP_VE_NET_ADMIN split. So, the question is: Could you point out the places in Andrey's implementation of network namespaces, which prevents you to add CAP_NET_ADMIN separation later? -- Thanks, Dmitry. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html