[PATCH] IPv6 - Add missing initializations of the new nl_info.nl_net field
Here is an updated version of the patch without the initializations to zero. Add some more missing initializations of the new nl_info.nl_net field in IPv6 stack. This field will be used when network namespaces are fully supported. Signed-off-by: Benjamin Thery [EMAIL PROTECTED] --- net/ipv6/addrconf.c |3 +++ net/ipv6/route.c|2 ++ 2 files changed, 5 insertions(+) Index: net-2.6.26/net/ipv6/addrconf.c === --- net-2.6.26.orig/net/ipv6/addrconf.c +++ net-2.6.26/net/ipv6/addrconf.c @@ -1557,6 +1557,7 @@ addrconf_prefix_route(struct in6_addr *p .fc_expires = expires, .fc_dst_len = plen, .fc_flags = RTF_UP | flags, + .fc_nlinfo.nl_net = init_net, }; ipv6_addr_copy(cfg.fc_dst, pfx); @@ -1583,6 +1584,7 @@ static void addrconf_add_mroute(struct n .fc_ifindex = dev-ifindex, .fc_dst_len = 8, .fc_flags = RTF_UP, + .fc_nlinfo.nl_net = init_net, }; ipv6_addr_set(cfg.fc_dst, htonl(0xFF00), 0, 0, 0); @@ -1599,6 +1601,7 @@ static void sit_route_add(struct net_dev .fc_ifindex = dev-ifindex, .fc_dst_len = 96, .fc_flags = RTF_UP | RTF_NONEXTHOP, + .fc_nlinfo.nl_net = init_net, }; /* prefix length - 96 bits ::d.d.d.d */ Index: net-2.6.26/net/ipv6/route.c === --- net-2.6.26.orig/net/ipv6/route.c +++ net-2.6.26/net/ipv6/route.c @@ -1719,6 +1719,8 @@ static void rtmsg_to_fib6_config(struct cfg-fc_src_len = rtmsg-rtmsg_src_len; cfg-fc_flags = rtmsg-rtmsg_flags; + cfg-fc_nlinfo.nl_net = init_net; + ipv6_addr_copy(cfg-fc_dst, rtmsg-rtmsg_dst); ipv6_addr_copy(cfg-fc_src, rtmsg-rtmsg_src); ipv6_addr_copy(cfg-fc_gateway, rtmsg-rtmsg_gateway); -- -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] IPv6 Add more initializations of the new nl_info.nl_net field
Add more missing initializations of the new nl_info.nl_net field in IPv6 stack. This field will be used when network namespaces are fully supported. Signed-off-by: Benjamin Thery [EMAIL PROTECTED] --- net/ipv6/addrconf.c |9 + net/ipv6/route.c|6 ++ 2 files changed, 15 insertions(+) Index: net-2.6.26/net/ipv6/addrconf.c === --- net-2.6.26.orig/net/ipv6/addrconf.c +++ net-2.6.26/net/ipv6/addrconf.c @@ -1557,6 +1557,9 @@ addrconf_prefix_route(struct in6_addr *p .fc_expires = expires, .fc_dst_len = plen, .fc_flags = RTF_UP | flags, + .fc_nlinfo.pid = 0, + .fc_nlinfo.nlh = NULL, + .fc_nlinfo.nl_net = init_net, }; ipv6_addr_copy(cfg.fc_dst, pfx); @@ -1583,6 +1586,9 @@ static void addrconf_add_mroute(struct n .fc_ifindex = dev-ifindex, .fc_dst_len = 8, .fc_flags = RTF_UP, + .fc_nlinfo.pid = 0, + .fc_nlinfo.nlh = NULL, + .fc_nlinfo.nl_net = init_net, }; ipv6_addr_set(cfg.fc_dst, htonl(0xFF00), 0, 0, 0); @@ -1599,6 +1605,9 @@ static void sit_route_add(struct net_dev .fc_ifindex = dev-ifindex, .fc_dst_len = 96, .fc_flags = RTF_UP | RTF_NONEXTHOP, + .fc_nlinfo.pid = 0, + .fc_nlinfo.nlh = NULL, + .fc_nlinfo.nl_net = init_net, }; /* prefix length - 96 bits ::d.d.d.d */ Index: net-2.6.26/net/ipv6/route.c === --- net-2.6.26.orig/net/ipv6/route.c +++ net-2.6.26/net/ipv6/route.c @@ -604,6 +604,8 @@ static int __ip6_ins_rt(struct rt6_info int ip6_ins_rt(struct rt6_info *rt) { struct nl_info info = { + .pid = 0, + .nlh = NULL, .nl_net = init_net, }; return __ip6_ins_rt(rt, info); @@ -1264,6 +1266,8 @@ static int __ip6_del_rt(struct rt6_info int ip6_del_rt(struct rt6_info *rt) { struct nl_info info = { + .pid = 0, + .nlh = NULL, .nl_net = init_net, }; return __ip6_del_rt(rt, info); @@ -1719,6 +1723,8 @@ static void rtmsg_to_fib6_config(struct cfg-fc_src_len = rtmsg-rtmsg_src_len; cfg-fc_flags = rtmsg-rtmsg_flags; + cfg-fc_nlinfo.nl_net = init_net; + ipv6_addr_copy(cfg-fc_dst, rtmsg-rtmsg_dst); ipv6_addr_copy(cfg-fc_src, rtmsg-rtmsg_src); ipv6_addr_copy(cfg-fc_gateway, rtmsg-rtmsg_gateway); -- -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [2.6.25-rc2] System freezes ca. 1 minute after logging into KDE
On Feb 17, 2008 11:39 AM, Frans Pop [EMAIL PROTECTED] wrote: (resend a third time because previous attempts never reached the lists due to a bug in my MUA; my apologies to David for spamming his inbox) Linus Torvalds wrote: But hey, you can try to prove me wrong. I dare you. Me too, me too! Weird issue this. About a minute after logging into KDE the system freezes, but only partially. The keyboard is completely dead in all cases (no console switching, no SysRq), but some tasks stay running. One time music continued playing, other times it stopped. One time the desktop clock continued ticking, other times it stopped. One time I could close a window using the mouse, but other windows were frozen. It's not just KDE that's frozen; one time I switched to VT1 before the freeze happened, but that became unusable too. Zilch in the logs. I've bisected it down to: commit 69cc64d8d92bf852f933e90c888dfff083bd4fc9 Author: David S. Miller [EMAIL PROTECTED] [NDISC]: Fix race in generic address resolution Confirmed that this is really the culprit by reverting this commit on top of -rc2, which is now running fine. I'm using IPv6 (local network only) together with IPv4, use a bridge (br0) and have an NFS4 mount active. I've encountered the same issue last Thursday. Here, I can hang my machine with ping6. I've also bisected it down to the same commit. I've sent some kernel traces which shows how the soft lock up occurs. See thread: [PATCH][RFC] race in generic address resolution http://www.spinics.net/lists/netdev/msg55375.html Benjamin Cheers, FJP -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] IPv6 recursive locking
On Feb 17, 2008 7:30 PM, Daniel Lezcano [EMAIL PROTECTED] wrote: Kristof Provost wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I'm running the current git (1309d4e68497184d2fd87e892ddf14076c2bda98) without problems. While I was toying with IPv6 on my local network I managed to completely hang my machine whenever it receives or sends a neighbour sollictation. At least, I think that's the cause. It started as soon as I installed radvd on the router. The included trace seems to point in the same direction. The machine is a Dell Latitude D505 (so x86). Network interfaces are e100 and ipw2200 (firmware not loaded). I'm currently using the e100. I'll try to bisect it but here's the trace already. Let me know if there's anything else you'd like to know. I think this bug was introduced by the commit: 69cc64d8d92bf852f933e90c888dfff083bd4fc9 [NDISC]: Fix race in generic address resolution. I confirm this commit is the culprit. I reported the same bug last Thursday, but it seems I made a mistake: I replied to the original thread which led to this commit to report it. But as the thread was a bit old it seems my answer hadn't been noticed. See http://www.spinics.net/lists/netdev/msg55373.html The lockup happens very quickly when you have IPv6 configured. I think we should revert this commit for now. Benjamin [ 124.439831] = [ 124.443689] [ INFO: possible recursive locking detected ] [ 124.443689] 2.6.25-rc2 #33 [ 124.443689] - [ 124.443689] swapper/0 is trying to acquire lock: [ 124.443689] (n-lock){-+-+}, at: [c0468d39] neigh_resolve_output+0x139/0x290 [ 124.443689] [ 124.443689] but task is already holding lock: [ 124.443689] (n-lock){-+-+}, at: [c0468ea4] neigh_timer_handler+0x14/0x280 [ 124.443689] [ 124.443689] other info that might help us debug this: [ 124.443689] 1 lock held by swapper/0: [ 124.443689] #0: (n-lock){-+-+}, at: [c0468ea4] neigh_timer_handler+0x14/0x280 [ 124.443689] [ 124.443689] stack backtrace: [ 124.443689] Pid: 0, comm: swapper Not tainted 2.6.25-rc2 #33 [ 124.443689] [c014863a] __lock_acquire+0xd3a/0xf40 [ 124.443689] [c0137ec8] __kernel_text_address+0x18/0x30 [ 124.443689] [c01488a0] lock_acquire+0x60/0x80 [ 124.443689] [c0468d39] neigh_resolve_output+0x139/0x290 [ 124.443689] [c059287e] _write_lock_bh+0x2e/0x40 [ 124.443689] [c0468d39] neigh_resolve_output+0x139/0x290 [ 124.443689] [c0468d39] neigh_resolve_output+0x139/0x290 [ 124.443689] [c0148805] __lock_acquire+0xf05/0xf40 [ 124.443689] [c04e1650] ndisc_dst_alloc+0xe0/0x170 [ 124.443689] [c04d39f4] ip6_output_finish+0xa4/0x110 [ 124.443689] [c0147a1d] __lock_acquire+0x11d/0xf40 [ 124.443689] [c04d4759] ip6_output+0x5b9/0xba0 [ 124.443689] [c0456eb6] sock_alloc_send_skb+0x176/0x1d0 [ 124.443689] [c04e4eab] __ndisc_send+0x33b/0x540 [ 124.443690] [c04e4d6e] __ndisc_send+0x1fe/0x540 [ 124.443690] [c04e5b69] ndisc_send_ns+0x69/0xa0 [ 124.443690] [c04e6c8e] ndisc_solicit+0xee/0x1b0 [ 124.443690] [c01472b5] mark_held_locks+0x35/0x80 [ 124.443690] [c0592c65] _spin_unlock_irqrestore+0x45/0x60 [ 124.443690] [c01473f9] trace_hardirqs_on+0x79/0x130 [ 124.443690] [c012f99f] __mod_timer+0x9f/0xb0 [ 124.443690] [c0468fd3] neigh_timer_handler+0x143/0x280 [ 124.443690] [c012f2ca] run_timer_softirq+0x14a/0x1c0 [ 124.443690] [c0468e90] neigh_timer_handler+0x0/0x280 [ 124.443690] [c0468e90] neigh_timer_handler+0x0/0x280 [ 124.443690] [c012b4c4] __do_softirq+0x84/0x100 [ 124.443690] [c012b595] do_softirq+0x55/0x60 [ 124.443690] [c012b9e5] irq_exit+0x65/0x80 [ 124.443690] [c01073b0] do_IRQ+0x40/0x70 [ 124.443690] [c010585e] common_interrupt+0x2e/0x34 [ 124.443690] [c032007b] acpi_power_on+0x3b/0x104 [ 124.443690] [c0322af6] acpi_idle_enter_simple+0x194/0x1fe [ 124.443690] [c0322727] acpi_idle_enter_bm+0xc1/0x2fc [ 124.443690] [c03fff43] cpuidle_idle_call+0x63/0xb0 [ 124.443690] [c03ffee0] cpuidle_idle_call+0x0/0xb0 [ 124.443690] [c010380d] cpu_idle+0x5d/0xf0 [ 124.443690] === Kristof -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] race in generic address resolution
Hi, It seems this patch hangs my machine very quickly when there are some ICMPv6 traffic. I'm using net-2.6, pulled today (14th Feb). I had some unexpected hangs on my SMP test machines and I bisected the problem to 69cc64d8d92bf852f933e90c888dfff083bd4fc9 [NDISC]: Fix race in generic address resolution. Looks like a deadlock: BUG: soft lockup - CPU#1 stuck for 61s! [swapper:0] Here are some traces printed on the console: Pid: 0, comm: swapper Not tainted (2.6.25-rc1-netns-00113-g69cc64d-dirty #34) EIP: 0060:[c02eb5f6] EFLAGS: 0287 CPU: 0 EIP is at __write_lock_failed+0xa/0x20 EAX: c7b3fab4 EBX: c7b3fab4 ECX: EDX: c0377986 ESI: c7b3fa90 EDI: c7b6f290 EBP: c03cbd24 ESP: c03cbd24 DS: 007b ES: 007b FS: 00d8 GS: SS: 0068 CR0: 8005003b CR2: b7f9b404 CR3: 07ac8000 CR4: 0690 DR0: DR1: DR2: DR3: DR6: DR7: [c020e43f] _raw_write_lock+0x57/0x6c [c02eba95] _write_lock_bh+0x25/0x2d [c026b107] ? neigh_resolve_output+0x93/0x238 [c026b107] neigh_resolve_output+0x93/0x238 [c02a5635] ip6_output2+0x241/0x289 [c02a61cd] ip6_output+0xa92/0xaad [c025ff11] ? __alloc_skb+0x4f/0xfb [c02b2596] ? __ndisc_send+0x1fb/0x3f5 [c02b26a0] __ndisc_send+0x305/0x3f5 [c02b2fb5] ndisc_send_ns+0x63/0x6e [c02b3f3e] ndisc_solicit+0x183/0x18d [c0121071] ? __mod_timer+0x96/0xa1 [c026b81e] neigh_timer_handler+0x214/0x252 [c0120c90] run_timer_softirq+0xfe/0x159 [c026b60a] ? neigh_timer_handler+0x0/0x252 [c011dbfa] __do_softirq+0x6f/0xe9 [c011dcae] do_softirq+0x3a/0x52 [c011dfc3] irq_exit+0x44/0x46 [c0105273] do_IRQ+0x5a/0x73 [c0103666] common_interrupt+0x2e/0x34 [c0101954] ? default_idle+0x4a/0x77 [c010190a] ? default_idle+0x0/0x77 [c0101855] cpu_idle+0x89/0x9d [c02e6135] rest_init+0x49/0x4b === BUG: soft lockup - CPU#1 stuck for 61s! [swapper:0] Pid: 0, comm: swapper Not tainted (2.6.25-rc1-netns-00113-g69cc64d-dirty #34) EIP: 0060:[c02eb5f6] EFLAGS: 0287 CPU: 1 EIP is at __write_lock_failed+0xa/0x20 EAX: c7b3fab4 EBX: c7b3fab4 ECX: EDX: ESI: c03bb9c0 EDI: c7b3fab4 EBP: c7841eb0 ESP: c7841eb0 DS: 007b ES: 007b FS: 00d8 GS: SS: 0068 CR0: 8005003b CR2: 08560008 CR3: 07b04000 CR4: 0690 DR0: DR1: DR2: DR3: DR6: DR7: [c020e43f] _raw_write_lock+0x57/0x6c [c02eba68] _write_lock+0x20/0x28 [c026982c] ? neigh_periodic_timer+0x99/0x142 [c026982c] neigh_periodic_timer+0x99/0x142 [c0120c90] run_timer_softirq+0xfe/0x159 [c0269793] ? neigh_periodic_timer+0x0/0x142 [c011dbfa] __do_softirq+0x6f/0xe9 [c011dcae] do_softirq+0x3a/0x52 [c011dfc3] irq_exit+0x44/0x46 [c010d680] smp_apic_timer_interrupt+0x71/0x81 [c0103747] apic_timer_interrupt+0x33/0x38 [c0101954] ? default_idle+0x4a/0x77 [c010190a] ? default_idle+0x0/0x77 [c0101855] cpu_idle+0x89/0x9d === BUG: soft lockup - CPU#0 stuck for 61s! [swapper:0] Pid: 0, comm: swapper Not tainted (2.6.25-rc1-netns-00113-g69cc64d-dirty #34) EIP: 0060:[c02eb5f6] EFLAGS: 0287 CPU: 0 EIP is at __write_lock_failed+0xa/0x20 EAX: c7b3fab4 EBX: c7b3fab4 ECX: EDX: c0377986 ESI: c7b3fa90 EDI: c7b6f290 EBP: c03cbd24 ESP: c03cbd24 DS: 007b ES: 007b FS: 00d8 GS: SS: 0068 CR0: 8005003b CR2: b7f9b404 CR3: 07ac8000 CR4: 0690 DR0: DR1: DR2: DR3: DR6: DR7: [c020e43f] _raw_write_lock+0x57/0x6c [c02eba95] _write_lock_bh+0x25/0x2d [c026b107] ? neigh_resolve_output+0x93/0x238 [c026b107] neigh_resolve_output+0x93/0x238 [c02a5635] ip6_output2+0x241/0x289 [c02a61cd] ip6_output+0xa92/0xaad [c025ff11] ? __alloc_skb+0x4f/0xfb [c02b2596] ? __ndisc_send+0x1fb/0x3f5 [c02b26a0] __ndisc_send+0x305/0x3f5 [c02b2fb5] ndisc_send_ns+0x63/0x6e [c02b3f3e] ndisc_solicit+0x183/0x18d [c0121071] ? __mod_timer+0x96/0xa1 [c026b81e] neigh_timer_handler+0x214/0x252 [c0120c90] run_timer_softirq+0xfe/0x159 [c026b60a] ? neigh_timer_handler+0x0/0x252 [c011dbfa] __do_softirq+0x6f/0xe9 [c011dcae] do_softirq+0x3a/0x52 [c011dfc3] irq_exit+0x44/0x46 [c0105273] do_IRQ+0x5a/0x73 [c0103666] common_interrupt+0x2e/0x34 [c0101954] ? default_idle+0x4a/0x77 [c010190a] ? default_idle+0x0/0x77 [c0101855] cpu_idle+0x89/0x9d [c02e6135] rest_init+0x49/0x4b === BUG: soft lockup - CPU#1 stuck for 61s! [swapper:0] ... Benjamin On Tue, Feb 12, 2008 at 6:47 AM, David Miller [EMAIL PROTECTED] wrote: From: Frank Blaschka [EMAIL PROTECTED] Date: Mon, 11 Feb 2008 10:01:20 +0100 we run your patch during the weekend on single CPU and SMP machines. We do not see any problems. Thanks for providing the fix. Thanks for testing Frank, I can now push this fix upstream. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe netdev in the
Re: [PATCH][RFC] race in generic address resolution
I ran some additional tests and these traces may also be usefull. They appears before the soft-lockup are detected. fermi:~# ping6 -c 500 -f 2007::1 PING 2007::1(2007::1) 56 data bytes . === [ INFO: possible circular locking dependency detected ] 2.6.25-rc1-00113-g69cc64d-dirty #34 --- ping6/1058 is trying to acquire lock: (tbl-lock){-+-+}, at: [c02691ac] neigh_lookup+0x43/0xa2 but task is already holding lock: (n-lock){-+..}, at: [c026b620] neigh_timer_handler+0x16/0x252 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: - #1 (n-lock){-+..}: [c01330b8] __lock_acquire+0x947/0xafc [c026982c] neigh_periodic_timer+0x99/0x142 [c01332d0] lock_acquire+0x63/0x80 [c026982c] neigh_periodic_timer+0x99/0x142 [c02eba61] _write_lock+0x19/0x28 [c026982c] neigh_periodic_timer+0x99/0x142 [c026982c] neigh_periodic_timer+0x99/0x142 [c0120c90] run_timer_softirq+0xfe/0x159 [c0269793] neigh_periodic_timer+0x0/0x142 [c011dbfa] __do_softirq+0x6f/0xe9 [c011dcae] do_softirq+0x3a/0x52 [c011dfc3] irq_exit+0x44/0x46 [c010d680] smp_apic_timer_interrupt+0x71/0x81 [c0103747] apic_timer_interrupt+0x33/0x38 [c014e0ce] mmap_region+0xe1/0x376 [c014e680] arch_get_unmapped_area_topdown+0x0/0x12e [c014e625] do_mmap_pgoff+0x1e2/0x23d [c0181895] elf_map+0xd8/0x104 [c0182072] load_elf_binary+0x5b4/0x11cd [c015ed73] search_binary_handler+0x74/0x164 [c0181abe] load_elf_binary+0x0/0x11cd [c015ed7a] search_binary_handler+0x7b/0x164 [c015ff2e] do_execve+0x121/0x16a [c01012e3] sys_execve+0x29/0x52 [c0102c56] syscall_call+0x7/0xb [] 0x - #0 (tbl-lock){-+-+}: [c0132fdf] __lock_acquire+0x86e/0xafc [c01332d0] lock_acquire+0x63/0x80 [c02691ac] neigh_lookup+0x43/0xa2 [c02ebae9] _read_lock_bh+0x1e/0x2d [c02691ac] neigh_lookup+0x43/0xa2 [c02691ac] neigh_lookup+0x43/0xa2 [c02af858] ndisc_dst_alloc+0xb5/0x155 [c02b240d] __ndisc_send+0x72/0x3f5 [c02a573b] ip6_output+0x0/0xaad [c0133225] __lock_acquire+0xab4/0xafc [c02b2fb5] ndisc_send_ns+0x63/0x6e [c02eb92c] _read_unlock_bh+0x25/0x28 [c02b3f3e] ndisc_solicit+0x183/0x18d [c0121071] __mod_timer+0x96/0xa1 [c026b81e] neigh_timer_handler+0x214/0x252 [c0120c90] run_timer_softirq+0xfe/0x159 [c026b60a] neigh_timer_handler+0x0/0x252 [c011dbfa] __do_softirq+0x6f/0xe9 [c011dcae] do_softirq+0x3a/0x52 [c011dfc3] irq_exit+0x44/0x46 [c0105273] do_IRQ+0x5a/0x73 [c0103666] common_interrupt+0x2e/0x34 [c02ebd3a] _spin_unlock_irqrestore+0x38/0x3c [c02188fb] tty_ldisc_deref+0x5c/0x63 [c021a5bd] tty_write+0x1a8/0x1b9 [c021c5e1] write_chan+0x0/0x2a9 [c021a633] redirected_tty_write+0x65/0x72 [c021a5ce] redirected_tty_write+0x0/0x72 [c015be18] vfs_write+0x8c/0x108 [c015c3a2] sys_write+0x3b/0x60 [c0102c56] syscall_call+0x7/0xb [] 0x other info that might help us debug this: 1 lock held by ping6/1058: #0: (n-lock){-+..}, at: [c026b620] neigh_timer_handler+0x16/0x252 stack backtrace: Pid: 1058, comm: ping6 Not tainted 2.6.25-rc1-netns-00113-g69cc64d-dirty #34 [c013176b] print_circular_bug_tail+0x5b/0x66 [c0132fdf] __lock_acquire+0x86e/0xafc [c01332d0] lock_acquire+0x63/0x80 [c02691ac] ? neigh_lookup+0x43/0xa2 [c02ebae9] _read_lock_bh+0x1e/0x2d [c02691ac] ? neigh_lookup+0x43/0xa2 [c02691ac] neigh_lookup+0x43/0xa2 [c02af858] ndisc_dst_alloc+0xb5/0x155 [c02b240d] __ndisc_send+0x72/0x3f5 [c02a573b] ? ip6_output+0x0/0xaad [c0133225] ? __lock_acquire+0xab4/0xafc [c02b2fb5] ndisc_send_ns+0x63/0x6e [c02eb92c] ? _read_unlock_bh+0x25/0x28 [c02b3f3e] ndisc_solicit+0x183/0x18d [c0121071] ? __mod_timer+0x96/0xa1 [c026b81e] neigh_timer_handler+0x214/0x252 [c0120c90] run_timer_softirq+0xfe/0x159 [c026b60a] ? neigh_timer_handler+0x0/0x252 [c011dbfa] __do_softirq+0x6f/0xe9 [c011dcae] do_softirq+0x3a/0x52 [c011dfc3] irq_exit+0x44/0x46 [c0105273] do_IRQ+0x5a/0x73 [c0103666] common_interrupt+0x2e/0x34 [c02ebd3a] ? _spin_unlock_irqrestore+0x38/0x3c [c02188fb] tty_ldisc_deref+0x5c/0x63 [c021a5bd] tty_write+0x1a8/0x1b9 [c021c5e1] ? write_chan+0x0/0x2a9 [c021a633] redirected_tty_write+0x65/0x72 [c021a5ce] ? redirected_tty_write+0x0/0x72 [c015be18] vfs_write+0x8c/0x108 [c015c3a2] sys_write+0x3b/0x60 [c0102c56] syscall_call+0x7/0xb === On Thu, Feb 14, 2008 at 5:56 PM, Benjamin Thery [EMAIL PROTECTED] wrote: Hi, It seems this patch hangs my machine very quickly when there are some ICMPv6 traffic. I'm using net-2.6, pulled today (14th Feb). I had some unexpected hangs on my SMP test machines and I bisected
[PATCH 1/1][NETNS] Add missing initialization of nl_info.nl_net in rtm_to_fib6_config()
Add missing initialization of the new nl_info.nl_net field in rtm_to_fib6_config(). This will be needed the store network namespace associated to the fib6_config struct. Signed-off-by: Benjamin Thery [EMAIL PROTECTED] --- net/ipv6/route.c |1 + 1 file changed, 1 insertion(+) Index: net-2.6.25/net/ipv6/route.c === --- net-2.6.25.orig/net/ipv6/route.c +++ net-2.6.25/net/ipv6/route.c @@ -1955,6 +1955,7 @@ static int rtm_to_fib6_config(struct sk_ cfg-fc_nlinfo.pid = NETLINK_CB(skb).pid; cfg-fc_nlinfo.nlh = nlh; + cfg-fc_nlinfo.nl_net = skb-sk-sk_net; if (tb[RTA_GATEWAY]) { nla_memcpy(cfg-fc_gateway, tb[RTA_GATEWAY], 16); -- -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PROCFS] [NETNS] issue with /proc/net entries
Eric W. Biederman wrote: Benjamin Thery [EMAIL PROTECTED] writes: Hi Eric, While testing the current network namespace stuff merged in net-2.6.25, I bumped into the following problem with the /proc/net/ entries. It doesn't always display the actual data of the current namespace, but sometime displays data from other namespaces. I bisected the problem to the commit: proc: remove/Fix proc generic d_revalidate 3790ee4bd86396558eedd86faac1052cb782e4e1 The problem: If a process in a particular network namespace changes current directory to /proc/net, then processes in other network namespaces trying to look at /proc/net entries will see data from the first namespace (the one with CWD /proc/net). (See test case below). As you comments in the commit suggest, you seem to be aware of some issues when CONFIG_NET_NS=y. Is it one of these corner cases you identified? Any idea on how we can fix it? Yes. It isn't especially hard. I have most of it in my queue I just need to get the silly patches out of there. Essentially we need to fix the caching of proc_generic entries, So that we can have a proper d_revalidate implementation. To get d_revalidate and the caching correct for /proc/net will take just a bit more work. We need to make /proc/net a symlink to something like /proc/self/net so that we don't get excess revalidates when switching between different processes. Or else we can't properly implement the case you have described. Where being in the directory causes the wrong version of /proc/net to show up. Changing the contents of the dentry for /proc/net should only happen during unshare. Not when we switch between processes or else we get into the d_revalidate leaks mount points problem again. We also need the check to see if something is mounted on top of us before we call drop the dentry. But if we don't even try until we know the dentry is invalid it should not be too bad. Thanks for all the details. I'll put this issue on my netns current limitations list until it's solved. Benjamin Eric -- B e n j a m i n T h e r y - BULL/DT/Open Software RD http://www.bull.com -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PROCFS] [NETNS] issue with /proc/net entries
Hi Eric, While testing the current network namespace stuff merged in net-2.6.25, I bumped into the following problem with the /proc/net/ entries. It doesn't always display the actual data of the current namespace, but sometime displays data from other namespaces. I bisected the problem to the commit: proc: remove/Fix proc generic d_revalidate 3790ee4bd86396558eedd86faac1052cb782e4e1 The problem: If a process in a particular network namespace changes current directory to /proc/net, then processes in other network namespaces trying to look at /proc/net entries will see data from the first namespace (the one with CWD /proc/net). (See test case below). As you comments in the commit suggest, you seem to be aware of some issues when CONFIG_NET_NS=y. Is it one of these corner cases you identified? Any idea on how we can fix it? Thanks. Benjamin Test case: -- (1) Shell 1, in init namespace: $ cat /proc/net/dev lo ... eth0 ... (2) Shell 2, in another network namespace $ cat /proc/net/dev lo ... (3) Shell 1 $ cd /proc/net $ cat dev lo ... eth0 ... (4) Shell 2 $ cat /proc/net/dev lo ... eth0 ... Argh, lo + eth0 in child namespace the device list of init netns is displayed in /proc/net/dev of child namespace :-( (5) Shell 1 $ cd / (6) Shell 2 $ cat /proc/net/dev lo ... Back to normality. -- B e n j a m i n T h e r y - BULL/DT/Open Software RD http://www.bull.com -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 5/9][NETNS][IPV6] make bindv6only sysctl per namespace
Daniel, The kernel fails to build with this patch applied when CONFIG_SYSCTL=n See comment below. Daniel Lezcano wrote: This patch moves the bindv6only sysctl to the network namespace structure. Until the ipv6 protocol is not per namespace, the sysctl variable is always from the initial network namespace. Signed-off-by: Daniel Lezcano [EMAIL PROTECTED] --- include/net/ipv6.h |1 - include/net/netns/ipv6.h |1 + net/ipv6/af_inet6.c|4 +--- net/ipv6/sysctl_net_ipv6.c |6 +- 4 files changed, 7 insertions(+), 5 deletions(-) Index: net-2.6.25/include/net/ipv6.h === --- net-2.6.25.orig/include/net/ipv6.h +++ net-2.6.25/include/net/ipv6.h @@ -109,7 +109,6 @@ struct frag_hdr { #include net/sock.h /* sysctls */ -extern int sysctl_ipv6_bindv6only; extern int sysctl_mld_max_msf; #define _DEVINC(statname, modifier, idev, field) \ Index: net-2.6.25/include/net/netns/ipv6.h === --- net-2.6.25.orig/include/net/netns/ipv6.h +++ net-2.6.25/include/net/netns/ipv6.h @@ -9,6 +9,7 @@ struct ctl_table_header; struct netns_sysctl_ipv6 { struct ctl_table_header *table; + int bindv6only; }; struct netns_ipv6 { Index: net-2.6.25/net/ipv6/af_inet6.c === --- net-2.6.25.orig/net/ipv6/af_inet6.c +++ net-2.6.25/net/ipv6/af_inet6.c @@ -66,8 +66,6 @@ MODULE_AUTHOR(Cast of dozens); MODULE_DESCRIPTION(IPv6 protocol stack for Linux); MODULE_LICENSE(GPL); -int sysctl_ipv6_bindv6only __read_mostly; - /* The inetsw6 table contains everything that inet6_create needs to * build a new socket. */ @@ -193,7 +191,7 @@ lookup_protocol: np-mcast_hops = -1; np-mc_loop = 1; np-pmtudisc = IPV6_PMTUDISC_WANT; - np-ipv6only = sysctl_ipv6_bindv6only; + np-ipv6only = init_net.ipv6.sysctl.bindv6only; The problem is here: init_net.ipv6.sysctl is not defined if CONFIG_SYSCTL=n. Benjamin /* Init the ipv4 part of the socket since we can have sockets * using v6 API for ipv4. Index: net-2.6.25/net/ipv6/sysctl_net_ipv6.c === --- net-2.6.25.orig/net/ipv6/sysctl_net_ipv6.c +++ net-2.6.25/net/ipv6/sysctl_net_ipv6.c @@ -35,7 +35,7 @@ static ctl_table ipv6_table_template[] = { .ctl_name = NET_IPV6_BINDV6ONLY, .procname = bindv6only, - .data = sysctl_ipv6_bindv6only, + .data = init_net.ipv6.sysctl.bindv6only, .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec @@ -115,6 +115,10 @@ static int ipv6_sysctl_net_init(struct n ipv6_table[0].child = ipv6_route_table; ipv6_table[1].child = ipv6_icmp_table; + ipv6_table[2].data = net-ipv6.sysctl.bindv6only; + + net-ipv6.sysctl.bindv6only = 0; + net-ipv6.sysctl.table = register_net_sysctl_table(net, ipv6_ctl_path, ipv6_table); if (!net-ipv6.sysctl.table) goto out_ipv6_icmp_table; -- -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- B e n j a m i n T h e r y - BULL/DT/Open Software RD http://www.bull.com -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc5-mm1
The problem comes from the new macro UDPX_INC_STATS_BH introduced by Herbert, which was a nice addition to increment the correct UDP MIB depending on the socket family, but unfortunately the use of this macro from kernel code (I mean code not compiled as module) requires that IPv6 is also compiled in kernel (CONFIG_IPv6=y) in order to have udp_stats_in6 defined at link time. Benjamin Pierre Peiffer wrote: Hi, My config does not link any more: ... CHK include/linux/compile.h UPD include/linux/compile.h CC init/version.o LD init/built-in.o LD .tmp_vmlinux1 net/built-in.o: In function `xs_udp_data_ready': /home/peifferp/containers/kernel/linux-2.6.24-rc5-mm1/net/sunrpc/xprtsock.c:842: undefined reference to `udp_stats_in6' /home/peifferp/containers/kernel/linux-2.6.24-rc5-mm1/net/sunrpc/xprtsock.c:846: undefined reference to `udp_stats_in6' make[1]: *** [.tmp_vmlinux1] Error 1 make: *** [sub-make] Error 2 After a first look, udp_stats_in6 seems to be defined in ipv6 (file net/ipv6/udp.c) but I have CONFIG_IPV6=m and CONFIG_SUNRPC=y So, SUNRPC uses something defined in a module in my case ? ... looking more, this dependency seems to have been introduced by the patch [UDP]: Restore missing inDatagrams increments ( http://thread.gmane.org/gmane.linux.network/79716/focus=79831 ) (I cc netdev) I don't know what is the right way to fix this ... ? P. Andrew Morton wrote: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc5/2.6.24-rc5-mm1/ - If something goes wrong with a PCI device's probing or initialisation, try reverting pci-disable-decoding-during-sizing-of-bars.patch. - git-sched was dropped due to breaking suspend-to-RAM. - git-block has been restored after having had a few problems - git-newsetup.patch was dropped due to conflicts with git-x86 - git-perfmon.patch is still dropped for the same reason - git-kgdb.patch is still dropped for the same reason - Please do try to cc the correct developer and mailing list when reporting problems - I'm just buried in bugs over here. Boilerplate: - See the `hot-fixes' directory for any important updates to this patchset. - To fetch an -mm tree using git, use (for example) git-fetch git://git.kernel.org/pub/scm/linux/kernel/git/smurf/linux-trees.git tag v2.6.16-rc2-mm1 git-checkout -b local-v2.6.16-rc2-mm1 v2.6.16-rc2-mm1 - -mm kernel commit activity can be reviewed by subscribing to the mm-commits mailing list. echo subscribe mm-commits | mail [EMAIL PROTECTED] - If you hit a bug in -mm and it is not obvious which patch caused it, it is most valuable if you can perform a bisection search to identify which patch introduced the bug. Instructions for this process are at http://www.zip.com.au/~akpm/linux/patches/stuff/bisecting-mm-trees.txt But beware that this process takes some time (around ten rebuilds and reboots), so consider reporting the bug first and if we cannot immediately identify the faulty patch, then perform the bisection search. - When reporting bugs, please try to Cc: the relevant maintainer and mailing list on any email. - When reporting bugs in this kernel via email, please also rewrite the email Subject: in some manner to reflect the nature of the bug. Some developers filter by Subject: when looking for messages to read. - Occasional snapshots of the -mm lineup are uploaded to ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/mm/ and are announced on the mm-commits list. These probably are at least compilable. - More-than-daily -mm snapshots may be found at http://userweb.kernel.org/~akpm/mmotm/. These are almost certainly not compileable. Changes since 2.6.24-rc4-mm1: origin.patch git-acpi.patch git-alsa.patch git-agpgart.patch git-arm.patch git-arm-master.patch git-avr32.patch git-cpufreq.patch git-powerpc.patch git-drm.patch git-dvb.patch git-hwmon.patch git-gfs2-nmw.patch git-hid.patch git-hrt.patch git-ieee1394.patch git-infiniband.patch git-input.patch git-jfs.patch git-kbuild.patch git-kvm.patch git-lblnet.patch git-leds.patch git-libata-all.patch git-md-accel.patch git-mips.patch git-mmc.patch git-mtd.patch git-ubi.patch git-net.patch git-netdev-all.patch git-battery.patch git-nfs.patch git-nfsd.patch git-ocfs2.patch git-s390.patch git-sh.patch git-scsi-misc.patch git-scsi-rc-fixes.patch git-block.patch git-unionfs.patch git-v9fs.patch git-watchdog.patch git-wireless.patch git-ipwireless_cs.patch git-x86.patch git-xfs.patch git-cryptodev.patch git-xtensa.patch git trees -aio-only-account-i-o-wait-time-in-read_events-if-there-are-active-requests.patch -fix-cloneclone_newpid.patch -rtc-assure-proper-memory-ordering-with-respect-to-rtc_dev_busy-flag.patch -ufs-fix-nexstep-dir-block-size.patch
Re: [PATCH 1/2] Convert /proc/net/ipv6_route to seq_file interface
Alexey Dobriyan wrote: One proc_net_create() user less. Funny, I was working on a similar patch. See comment below. Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED] --- net/ipv6/route.c | 70 +++ 1 file changed, 25 insertions(+), 45 deletions(-) --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -2288,71 +2288,49 @@ struct rt6_proc_arg static int rt6_info_route(struct rt6_info *rt, void *p_arg) { - struct rt6_proc_arg *arg = (struct rt6_proc_arg *) p_arg; + struct seq_file *m = p_arg; - if (arg-skip arg-offset / RT6_INFO_LEN) { - arg-skip++; - return 0; - } - - if (arg-len = arg-length) - return 0; - - arg-len += sprintf(arg-buffer + arg-len, - NIP6_SEQFMT %02x , - NIP6(rt-rt6i_dst.addr), + seq_printf(m, NIP6_SEQFMT %02x , NIP6(rt-rt6i_dst.addr), rt-rt6i_dst.plen); #ifdef CONFIG_IPV6_SUBTREES - arg-len += sprintf(arg-buffer + arg-len, - NIP6_SEQFMT %02x , - NIP6(rt-rt6i_src.addr), + seq_printf(m, NIP6_SEQFMT %02x , NIP6(rt-rt6i_src.addr), rt-rt6i_src.plen); #else - arg-len += sprintf(arg-buffer + arg-len, - 00 ); + seq_puts(m, 00 ); #endif if (rt-rt6i_nexthop) { - arg-len += sprintf(arg-buffer + arg-len, - NIP6_SEQFMT, + seq_printf(m, NIP6_SEQFMT, NIP6(*((struct in6_addr *)rt-rt6i_nexthop-primary_key))); } else { - arg-len += sprintf(arg-buffer + arg-len, - ); + seq_puts(m, ); } - arg-len += sprintf(arg-buffer + arg-len, - %08x %08x %08x %08x %8s\n, + seq_printf(m, %08x %08x %08x %08x %8s\n, rt-rt6i_metric, atomic_read(rt-u.dst.__refcnt), rt-u.dst.__use, rt-rt6i_flags, rt-rt6i_dev ? rt-rt6i_dev-name : ); return 0; } -static int rt6_proc_info(char *buffer, char **start, off_t offset, int length) +static int ipv6_route_show(struct seq_file *m, void *v) { - struct rt6_proc_arg arg = { - .buffer = buffer, - .offset = offset, - .length = length, - }; - - fib6_clean_all(rt6_info_route, 0, arg); - - *start = buffer; - if (offset) - *start += offset % RT6_INFO_LEN; - - arg.len -= offset % RT6_INFO_LEN; - - if (arg.len length) - arg.len = length; - if (arg.len 0) - arg.len = 0; + fib6_clean_all(rt6_info_route, 0, m); + return 0; +} - return arg.len; +static int ipv6_route_open(struct inode *inode, struct file *file) +{ + return single_open(file, ipv6_route_show, NULL); } +static const struct file_operations ipv6_route_proc_fops = { + .open = ipv6_route_open, + .read = seq_read, + .llseek = seq_lseek, + .release= single_release, +}; + static int rt6_stats_seq_show(struct seq_file *seq, void *v) { seq_printf(seq, %04x %04x %04x %04x %04x %04x %04x\n, @@ -2499,9 +2477,11 @@ void __init ip6_route_init(void) fib6_init(); #ifdef CONFIG_PROC_FS - p = proc_net_create(init_net, ipv6_route, 0, rt6_proc_info); - if (p) + p = create_proc_entry(ipv6_route, 0, init_net.proc_net); + if (p) { p-owner = THIS_MODULE; + p-proc_fops = ipv6_route_proc_fops; + } You should use proc_net_fops_create() instead of the above code. It does the same thing. Otherwise the patch looks fine to me. Tested on i386. Benjamin proc_net_fops_create(init_net, rt6_stats, S_IRUGO, rt6_stats_seq_fops); #endif - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- B e n j a m i n T h e r y - BULL/DT/Open Software RD http://www.bull.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] Convert /proc/net/ipv6_route to seq_file interface
Cosmetic comment: I forgot to say there are a few indentation errors when I apply your patch. See below. Benjamin Thery wrote: Alexey Dobriyan wrote: One proc_net_create() user less. Funny, I was working on a similar patch. See comment below. Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED] --- net/ipv6/route.c | 70 +++ 1 file changed, 25 insertions(+), 45 deletions(-) --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -2288,71 +2288,49 @@ struct rt6_proc_arg static int rt6_info_route(struct rt6_info *rt, void *p_arg) { -struct rt6_proc_arg *arg = (struct rt6_proc_arg *) p_arg; +struct seq_file *m = p_arg; -if (arg-skip arg-offset / RT6_INFO_LEN) { -arg-skip++; -return 0; -} - -if (arg-len = arg-length) -return 0; - -arg-len += sprintf(arg-buffer + arg-len, -NIP6_SEQFMT %02x , -NIP6(rt-rt6i_dst.addr), +seq_printf(m, NIP6_SEQFMT %02x , NIP6(rt-rt6i_dst.addr), rt-rt6i_dst.plen); #ifdef CONFIG_IPV6_SUBTREES -arg-len += sprintf(arg-buffer + arg-len, -NIP6_SEQFMT %02x , -NIP6(rt-rt6i_src.addr), +seq_printf(m, NIP6_SEQFMT %02x , NIP6(rt-rt6i_src.addr), rt-rt6i_src.plen); Indent is wrong for the above line. #else -arg-len += sprintf(arg-buffer + arg-len, - 00 ); +seq_puts(m, 00 ); #endif if (rt-rt6i_nexthop) { -arg-len += sprintf(arg-buffer + arg-len, -NIP6_SEQFMT, +seq_printf(m, NIP6_SEQFMT, NIP6(*((struct in6_addr *)rt-rt6i_nexthop-primary_key))); Idem. } else { -arg-len += sprintf(arg-buffer + arg-len, -); +seq_puts(m, ); } -arg-len += sprintf(arg-buffer + arg-len, - %08x %08x %08x %08x %8s\n, +seq_printf(m, %08x %08x %08x %08x %8s\n, rt-rt6i_metric, atomic_read(rt-u.dst.__refcnt), rt-u.dst.__use, rt-rt6i_flags, rt-rt6i_dev ? rt-rt6i_dev-name : ); Indent of the 3 above lines. return 0; } -static int rt6_proc_info(char *buffer, char **start, off_t offset, int length) +static int ipv6_route_show(struct seq_file *m, void *v) { -struct rt6_proc_arg arg = { -.buffer = buffer, -.offset = offset, -.length = length, -}; - -fib6_clean_all(rt6_info_route, 0, arg); - -*start = buffer; -if (offset) -*start += offset % RT6_INFO_LEN; - -arg.len -= offset % RT6_INFO_LEN; - -if (arg.len length) -arg.len = length; -if (arg.len 0) -arg.len = 0; +fib6_clean_all(rt6_info_route, 0, m); +return 0; +} -return arg.len; +static int ipv6_route_open(struct inode *inode, struct file *file) +{ +return single_open(file, ipv6_route_show, NULL); } +static const struct file_operations ipv6_route_proc_fops = { +.open = ipv6_route_open, +.read = seq_read, +.llseek = seq_lseek, +.release= single_release, +}; + static int rt6_stats_seq_show(struct seq_file *seq, void *v) { seq_printf(seq, %04x %04x %04x %04x %04x %04x %04x\n, @@ -2499,9 +2477,11 @@ void __init ip6_route_init(void) fib6_init(); #ifdef CONFIG_PROC_FS -p = proc_net_create(init_net, ipv6_route, 0, rt6_proc_info); -if (p) +p = create_proc_entry(ipv6_route, 0, init_net.proc_net); +if (p) { p-owner = THIS_MODULE; +p-proc_fops = ipv6_route_proc_fops; +} You should use proc_net_fops_create() instead of the above code. It does the same thing. Otherwise the patch looks fine to me. Tested on i386. Benjamin proc_net_fops_create(init_net, rt6_stats, S_IRUGO, rt6_stats_seq_fops); #endif - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- B e n j a m i n T h e r y - BULL/DT/Open Software RD http://www.bull.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NETNS] Oops in register_pernet_operations() with CONFIG_NET_NS=n
David Miller wrote: From: [EMAIL PROTECTED] (Eric W. Biederman) Date: Thu, 25 Oct 2007 11:21:55 -0600 By the way, I think that we can in the case of undefined CONFIG_NET_NS reduce register to calling -init method and unregister to calling -exit method. This is a correct thing at least for now and will be welcomed by the all embedded/etc people. I'm not fundamentally opposed. Earlier versions of my patchset did that and more. However I think the pain is greater then the gain right now. Especially since this concept seem to require having quality inspected into it. I think the correct thing to do for now is to simply remove these __net_* markers and their definitions. There are so many tricky cases that it is easier to just get rid of them. Could someone send me a patch which does that? The attached patch revert Pavel's orginal patch from 2.6.23-mm1. It should work fine with net-2.6 too. Benjamin -- B e n j a m i n T h e r y - BULL/DT/Open Software RD http://www.bull.com This patch reverts the patch sent by Pavel Emilyanov that introduced __net_init/__net_exit/__net_initdata defines to save some memory when CONFIG_NET_NS=n. http://www.spinics.net/lists/netdev/msg43310.html When CONFIG_NET_NS=n, this later patch causes an oops when a netns-aware module is loaded after boot. When initialized the module tries to register its pernet operations and add them in the pernet_list. Unfortunately this list is corrupted as its first entries have been freed at the end of the boot sequence. Signed-off-by: Benjamin Thery [EMAIL PROTECTED] --- drivers/net/loopback.c |6 +++--- fs/proc/proc_net.c |8 include/linux/init.h|1 - include/net/net_namespace.h |9 - net/core/dev.c | 16 net/core/dev_mcast.c|6 +++--- net/netlink/af_netlink.c|6 +++--- scripts/mod/modpost.c |1 - 8 files changed, 21 insertions(+), 32 deletions(-) Index: linux-2.6.23-mm1-lxc1/drivers/net/loopback.c === --- linux-2.6.23-mm1-lxc1.orig/drivers/net/loopback.c +++ linux-2.6.23-mm1-lxc1/drivers/net/loopback.c @@ -250,7 +250,7 @@ static void loopback_setup(struct net_de } /* Setup and register the loopback device. */ -static __net_init int loopback_net_init(struct net *net) +static int loopback_net_init(struct net *net) { struct net_device *dev; int err; @@ -278,14 +278,14 @@ out_free_netdev: goto out; } -static __net_exit void loopback_net_exit(struct net *net) +static void loopback_net_exit(struct net *net) { struct net_device *dev = net-loopback_dev; unregister_netdev(dev); } -static struct pernet_operations __net_initdata loopback_net_ops = { +static struct pernet_operations loopback_net_ops = { .init = loopback_net_init, .exit = loopback_net_exit, }; Index: linux-2.6.23-mm1-lxc1/fs/proc/proc_net.c === --- linux-2.6.23-mm1-lxc1.orig/fs/proc/proc_net.c +++ linux-2.6.23-mm1-lxc1/fs/proc/proc_net.c @@ -140,7 +140,7 @@ static struct inode_operations proc_net_ .setattr = proc_net_setattr, }; -static __net_init int proc_net_ns_init(struct net *net) +static int proc_net_ns_init(struct net *net) { struct proc_dir_entry *root, *netd, *net_statd; int err; @@ -178,19 +178,19 @@ free_root: goto out; } -static __net_exit void proc_net_ns_exit(struct net *net) +static void proc_net_ns_exit(struct net *net) { remove_proc_entry(stat, net-proc_net); remove_proc_entry(net, net-proc_net_root); kfree(net-proc_net_root); } -struct pernet_operations __net_initdata proc_net_ns_ops = { +struct pernet_operations proc_net_ns_ops = { .init = proc_net_ns_init, .exit = proc_net_ns_exit, }; -int __init proc_net_init(void) +int proc_net_init(void) { proc_net_shadow = proc_mkdir(net, NULL); proc_net_shadow-proc_iops = proc_net_dir_inode_operations; Index: linux-2.6.23-mm1-lxc1/include/linux/init.h === --- linux-2.6.23-mm1-lxc1.orig/include/linux/init.h +++ linux-2.6.23-mm1-lxc1/include/linux/init.h @@ -57,7 +57,6 @@ * The markers follow same syntax rules as __init / __initdata. */ #define __init_refok noinline __attribute__ ((__section__ (.text.init.refok))) #define __initdata_refok __attribute__ ((__section__ (.data.init.refok))) -#define __exit_refok noinline __attribute__ ((__section__ (.exit.text.refok))) #ifdef MODULE #define __exit __attribute__ ((__section__(.exit.text))) __cold Index: linux-2.6.23-mm1-lxc1/include/net/net_namespace.h === --- linux-2.6.23-mm1-lxc1.orig/include/net/net_namespace.h +++ linux-2.6.23-mm1-lxc1/include/net/net_namespace.h @@ -99,15 +99,6 @@ static inline void release_net(struct ne #define for_each_net(VAR)\ list_for_each_entry
[NETNS] Oops in register_pernet_operations() with CONFIG_NET_NS=n
Hello Pavel, I've found a problem with one of your patch related to netns: * [NETNS] Move some code into __init section when CONFIG_NET_NS=n (v2) http://www.spinics.net/lists/netdev/msg43310.html This patch introduces the __net_init/__net_exit/__net_initdata defines to save some memory when CONFIG_NET_NS is not set. Cedric Le Goater reported he had a *non-fatal* oops when booting a 2.6.23-mm1-lxc1 kernel with CONFIG_NET_NS=n. (2.6.23-mm1-lxc1 contains the NETNS49 patchset). The oops occured when modules related to iptables were loaded after the boot completes. The problem is the following: - Your patch adds the __net_initdata attribute to pernet_operations structures. - pernet_operations are registered via register_pernet_subsys() and linked in the pernet_list during boot. - At the end of boot, pernet_operations are freed (because of the __net_initdata attribute), and the pernet_list (or first_device list) points to freed memory. - After boot, network modules which are netns-aware try to register themselves with register_pernet_subsys() and ...KABOOM... page fault when accessing pernet_list (or first_device list). (I reproduce Cedric's oops with the command: iptables --list) This is not a problem right now in 2.6.23-mm1 or net-2.6 because there are very few netns-aware network subsystems merged and they are all initialized during boot. But it will be problematic when we will merge netns code for subsystems which can be built as modules (eg. iptables, ...). I'm not sure we can use __net_init_data for pernet_operations then. Maybe we can add some checks in register_pernet_operations when CONFIG_NET_NS=n. I haven't found a fix yet. Regards, Benjamin -- B e n j a m i n T h e r y - BULL/DT/Open Software RD http://www.bull.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NETNS] Oops in register_pernet_operations() with CONFIG_NET_NS=n
Denis V. Lunev wrote: The patch attached should help. The idea is simple. The init should be called only once without NETNS. Period. No need for any lists. This is the kind of idea I had but I didn't think it could be that simple. :) Thanks Denis. I'll resend it to Dave after the ACK. Tested on x86_64 with CONFIG_NET_NS=n and y. It fixes the issue we observed. Acked-by: Benjamin Thery [EMAIL PROTECTED] Regards, Den Benjamin Thery wrote: Hello Pavel, I've found a problem with one of your patch related to netns: * [NETNS] Move some code into __init section when CONFIG_NET_NS=n (v2) http://www.spinics.net/lists/netdev/msg43310.html This patch introduces the __net_init/__net_exit/__net_initdata defines to save some memory when CONFIG_NET_NS is not set. Cedric Le Goater reported he had a *non-fatal* oops when booting a 2.6.23-mm1-lxc1 kernel with CONFIG_NET_NS=n. (2.6.23-mm1-lxc1 contains the NETNS49 patchset). The oops occured when modules related to iptables were loaded after the boot completes. The problem is the following: - Your patch adds the __net_initdata attribute to pernet_operations structures. - pernet_operations are registered via register_pernet_subsys() and linked in the pernet_list during boot. - At the end of boot, pernet_operations are freed (because of the __net_initdata attribute), and the pernet_list (or first_device list) points to freed memory. - After boot, network modules which are netns-aware try to register themselves with register_pernet_subsys() and ...KABOOM... page fault when accessing pernet_list (or first_device list). (I reproduce Cedric's oops with the command: iptables --list) This is not a problem right now in 2.6.23-mm1 or net-2.6 because there are very few netns-aware network subsystems merged and they are all initialized during boot. But it will be problematic when we will merge netns code for subsystems which can be built as modules (eg. iptables, ...). I'm not sure we can use __net_init_data for pernet_operations then. Maybe we can add some checks in register_pernet_operations when CONFIG_NET_NS=n. I haven't found a fix yet. Regards, Benjamin -- B e n j a m i n T h e r y - BULL/DT/Open Software RD http://www.bull.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NETNS] Oops in register_pernet_operations() with CONFIG_NET_NS=n
Eric W. Biederman wrote: Benjamin Thery [EMAIL PROTECTED] writes: Denis V. Lunev wrote: The patch attached should help. The idea is simple. The init should be called only once without NETNS. Period. No need for any lists. This is the kind of idea I had but I didn't think it could be that simple. :) Thanks Denis. It isn't. I'll resend it to Dave after the ACK. Tested on x86_64 with CONFIG_NET_NS=n and y. It fixes the issue we observed. Acked-by: Benjamin Thery [EMAIL PROTECTED] Try rmmod. rmmod was part of my tests and it does work. I did: $ iptables --list modules x_tables, ip_tables iptable_filter are loaded each calling register_pernet_subsys. $ rmmod iptable_filter ip_tables x_tables No problem here $ iptables --list To be sure I can load the modules again. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- B e n j a m i n T h e r y - BULL/DT/Open Software RD http://www.bull.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [IPv6]: use container_of() macro in fib6_clean_node()
In ip6_fib.c, fib6_clean_node() casts a fib6_walker_t pointer to a fib6_cleaner_t pointer assuming a struct fib6_walker_t (field 'w') is the first field in struct fib6_walker_t. To prevent any future problems that may occur if one day a field is inadvertently inserted before the 'w' field in struct fib6_cleaner_t, (and to improve readability), this patch uses the container_of() macro. Patch for net-2.6.24 Signed-off-by: Benjamin Thery [EMAIL PROTECTED] --- net/ipv6/ip6_fib.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index 6a612a7..946cf38 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -1313,7 +1313,7 @@ static int fib6_clean_node(struct fib6_walker_t *w) { int res; struct rt6_info *rt; - struct fib6_cleaner_t *c = (struct fib6_cleaner_t*)w; + struct fib6_cleaner_t *c = container_of(w, struct fib6_cleaner_t, w); for (rt = w-leaf; rt; rt = rt-u.dst.rt6_next) { res = c-func(rt, c-arg); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/1][NET] : Fix dev_put() and dev_hold() comments
Trivial fix: Swap comments for dev_put() and dev_hold() to get them at the right place. Typo introduced by 4fa57c9ea9f36f9ca852f3a88ca5d2f1aebbc960. Signed-of-by: Benjamin Thery [EMAIL PROTECTED] --- include/linux/netdevice.h |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: net-2.6.24/include/linux/netdevice.h === --- net-2.6.24.orig/include/linux/netdevice.h +++ net-2.6.24/include/linux/netdevice.h @@ -1054,7 +1054,7 @@ extern void netdev_run_todo(void); * dev_put - release reference to device * @dev: network device * - * Hold reference to device to keep it from being freed. + * Release reference to device to allow it to be freed. */ static inline void dev_put(struct net_device *dev) { @@ -1065,7 +1065,7 @@ static inline void dev_put(struct net_de * dev_hold - get reference to device * @dev: network device * - * Release reference to device to allow it to be freed. + * Hold reference to device to keep it from being freed. */ static inline void dev_hold(struct net_device *dev) { -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Rename struct net to struct netns
Daniel Lezcano wrote: Pavel Emelyanov wrote: The name struct net is too generic. There already were some people who wanted to have some better name (for easier grep for example). I propose the struct netns one. The patch is (already) huge (sorry), but it's nothing but sed -e s/struct net\/struct netns/g If this name is bad as well, let's select a new one before the struct net floods the kernel. [ SNIP ] --- a/include/linux/nsproxy.h +++ b/include/linux/nsproxy.h @@ -29,7 +29,7 @@ struct nsproxy { struct mnt_namespace *mnt_ns; struct pid_namespace *pid_ns; struct user_namespace *user_ns; -struct net *net_ns; +struct netns *net_ns; IMHO, if we want to be consistent with all the rest of the namespaces, that should be net_namespace. Sure it's a good argument. But I find that 'net', although it is an uber generic name, represents its contents appropriately: its function is to store all data for a network stack, so it is what represent a network in the kernel. Anyway, if we want to change it, I think net_namespace is better than netns because of the consistency argument given by Daniel. (But it's longer :( ) Just my 2 cents. Benjamin [ SNIP ] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- B e n j a m i n T h e r y - BULL/DT/Open Software RD http://www.bull.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] net/core: Fix crash in dev_mc_sync()/dev_mc_unsync()
Oops, don't use the previous version of the patch: the change in dev_mc_unsync() was not correct. Sorry. This one is a lot better (it compiles and runs). :) Benjamin -- B e n j a m i n T h e r y - BULL/DT/Open Software RD http://www.bull.com From: [EMAIL PROTECTED] Subject: net/core: Fix crash in dev_mc_sync()/dev_mc_unsync() This patch fixes a crash that may occur when the routine dev_mc_sync() deletes an address from the list it is currently going through. It saves the pointer to the next element before deleting the current one. The problem may also exist in dev_mc_unsync(). Signed-off-by: Benjamin Thery [EMAIL PROTECTED] --- net/core/dev_mcast.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) Index: linux-2.6.23-rc2/net/core/dev_mcast.c === --- linux-2.6.23-rc2.orig/net/core/dev_mcast.c +++ linux-2.6.23-rc2/net/core/dev_mcast.c @@ -116,11 +116,13 @@ int dev_mc_add(struct net_device *dev, v */ int dev_mc_sync(struct net_device *to, struct net_device *from) { - struct dev_addr_list *da; + struct dev_addr_list *da, *next; int err = 0; netif_tx_lock_bh(to); - for (da = from-mc_list; da != NULL; da = da-next) { + da = from-mc_list; + while (da != NULL) { + next = da-next; if (!da-da_synced) { err = __dev_addr_add(to-mc_list, to-mc_count, da-da_addr, da-da_addrlen, 0); @@ -134,6 +136,7 @@ int dev_mc_sync(struct net_device *to, s __dev_addr_delete(from-mc_list, from-mc_count, da-da_addr, da-da_addrlen, 0); } + da = next; } if (!err) __dev_set_rx_mode(to); @@ -156,12 +159,14 @@ EXPORT_SYMBOL(dev_mc_sync); */ void dev_mc_unsync(struct net_device *to, struct net_device *from) { - struct dev_addr_list *da; + struct dev_addr_list *da, *next; netif_tx_lock_bh(from); netif_tx_lock_bh(to); - for (da = from-mc_list; da != NULL; da = da-next) { + da = from-mc_list; + while (da != NULL) { + next = da-next; if (!da-da_synced) continue; __dev_addr_delete(to-mc_list, to-mc_count, @@ -169,6 +174,7 @@ void dev_mc_unsync(struct net_device *to da-da_synced = 0; __dev_addr_delete(from-mc_list, from-mc_count, da-da_addr, da-da_addrlen, 0); + da = next; } __dev_set_rx_mode(to);
[PATCH 0/1] net/core: Crash in dev_mc_sync() when putting macvlan interface up
Hi, My kernel crashed while testing macvlan interfaces on 2.6.23-rc2. (See kernel panic below) The culprit is dev_mc_sync(). In this routine, we delete elements from 'from-mc_list' unsafely. While going through the list, we may delete one of the element (__dev_addr_delete(from-mc_list,...)), and then try to continue from that same element that have just been freed: for(..., da = da-next). It took me some time to understand why only one of my test machines was crashing. After a while I discovered my crashing victim has CONFIG_DEBUG_SLAB=y set, which poisons the freed 'struct dev_addr_list'. (Now I love poison!) The crash can be reproduced by setting the option CONFIG_DEBUG_SLAB=y. Then, add a macvlan interface and set it up. $ ip link add link eth0 type macvlan $ ip link macvlan0 up BUG: unable to handle kernel paging request at virtual address 6b6b6b6b printing eip: c025e9b4 *pde = Oops: [#1] Modules linked in: CPU:0 EIP:0060:[c025e9b4]Not tainted VLI EFLAGS: 0282 (2.6.23-rc2-eb-netns #6) EIP is at dev_mc_sync+0x5f/0x197 eax: 0025 ebx: c11e5dec ecx: edx: 0046 esi: 6b6b6b6b edi: c1134060 ebp: c742fe6c esp: c742fe48 ds: 007b es: 007b fs: gs: 0033 ss: 0068 Process ifconfig (pid: 937, ti=c742e000 task=c1128000 task.ti=c742e000) Stack: c034c6dc 6b6b6b6b c1134060 c7bd2180 c1134218 c7bd2180 c7bd2338 1002 c742fe74 c02238a4 c742fe80 c025a9d8 c7bd2180 c742fe90 c025ab78 c7bd2180 1043 c742fe9c c025ce66 c7bd2180 c742fec0 c025b034 c7bd2180 Call Trace: [c0102c66] show_trace_log_lvl+0x1a/0x2f [c0102d18] show_stack_log_lvl+0x9d/0xa5 [c0102ede] show_registers+0x1be/0x28f [c0103097] die+0xe8/0x208 [c010d555] do_page_fault+0x4ba/0x595 [c02e3e62] error_code+0x6a/0x70 [c02238a4] macvlan_set_multicast_list+0x15/0x17 [c025a9d8] __dev_set_rx_mode+0x7e/0x81 [c025ab78] dev_set_rx_mode+0x25/0x3a [c025ce66] dev_open+0x4b/0x6a [c025b034] dev_change_flags+0xa4/0x159 [c028da20] devinet_ioctl+0x204/0x506 [c028e082] inet_ioctl+0x86/0xa4 [c02538f6] sock_ioctl+0x159/0x177 [c0152ac4] do_ioctl+0x1c/0x51 [c0152ce5] vfs_ioctl+0x1ec/0x203 [c0152d2d] sys_ioctl+0x31/0x48 [c01025ea] syscall_call+0x7/0xb === Code: 87 c8 01 00 00 00 00 00 00 8b b0 f8 00 00 00 c7 45 ec 00 00 00 00 e9 0a 01 00 00 89 74 24 04 c7 04 24 dc c6 34 c0 e8 57 44 eb ff 8b 06 c7 04 24 f9 c6 34 c0 89 44 24 04 e8 45 44 eb ff 80 7e 25 EIP: [c025e9b4] dev_mc_sync+0x5f/0x197 SS:ESP 0068:c742fe48 Kernel panic - not syncing: Fatal exception in interrupt I think the problem may also exist in dev_mc_unsync(). I have a patch that seems to fix the issue for me. Hope this helps. Regards, Benjamin -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/1] net/core: Fix crash in dev_mc_sync()/dev_mc_unsync()
This patch fixes a crash that may occur when the routine dev_mc_sync() deletes an address from the list it is currently going through. It saves the pointer to the next element before deleting the current one. The problem may also exist in dev_mc_unsync(). Signed-off-by: Benjamin Thery [EMAIL PROTECTED] --- net/core/dev_mcast.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) Index: linux-2.6.23-rc2/net/core/dev_mcast.c === --- linux-2.6.23-rc2.orig/net/core/dev_mcast.c +++ linux-2.6.23-rc2/net/core/dev_mcast.c @@ -116,11 +116,13 @@ int dev_mc_add(struct net_device *dev, v */ int dev_mc_sync(struct net_device *to, struct net_device *from) { - struct dev_addr_list *da; + struct dev_addr_list *da, *next; int err = 0; netif_tx_lock_bh(to); - for (da = from-mc_list; da != NULL; da = da-next) { + da = from-mc_list; + while (da != NULL) { + next = da-next; if (!da-da_synced) { err = __dev_addr_add(to-mc_list, to-mc_count, da-da_addr, da-da_addrlen, 0); @@ -134,6 +136,7 @@ int dev_mc_sync(struct net_device *to, s __dev_addr_delete(from-mc_list, from-mc_count, da-da_addr, da-da_addrlen, 0); } + da = next; } if (!err) __dev_set_rx_mode(to); @@ -156,12 +159,14 @@ EXPORT_SYMBOL(dev_mc_sync); */ void dev_mc_unsync(struct net_device *to, struct net_device *from) { - struct dev_addr_list *da; + struct dev_addr_list *da, next; netif_tx_lock_bh(from); netif_tx_lock_bh(to); - for (da = from-mc_list; da != NULL; da = da-next) { + da = from-mc_list; + while (da != NULL) { + next = da-next; if (!da-da_synced) continue; __dev_addr_delete(to-mc_list, to-mc_count, @@ -169,6 +174,7 @@ void dev_mc_unsync(struct net_device *to da-da_synced = 0; __dev_addr_delete(from-mc_list, from-mc_count, da-da_addr, da-da_addrlen, 0); + da = next; } __dev_set_rx_mode(to); -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] restore netdev_priv optimization
Hi, David Miller wrote: From: Stephen Hemminger [EMAIL PROTECTED] Date: Fri, 17 Aug 2007 15:40:22 -0700 Compile tested only!!! Obviously. The first loopback transmit is guarenteed to crash. [...] And this also breaks loopback again, which uses a static struct netdev in the kernel image, it doesn't use alloc_netdev(), so egress_subqueue of loopback will be NULL. Talking about loopback, don't you think it could be the right time to make it behave like any other kind of net devices, and allocate it dynamically. Having a dynamically allocated loopback could make maintenance easier (removing special cases). Also this is something we'll need to support multiple loopbacks for example for network namespaces. Eric Biederman has written a nice patch that does this. I'm using it on 2.6.23-rc2. Benjamin -- B e n j a m i n T h e r y - BULL/DT/Open Software RD http://www.bull.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
L2 network namespaces + macvlan performances
Following a discussion we had at OLS concerning L2 network namespace performances and how the new macvlan driver could potentially improve them, I've ported the macvlan patchset on top of Eric's net namespace patchset on 2.6.22-rc4-mm2. A little bit of history: Some months ago, when we ran some performance tests (using netperf) on net namespace, we observed the following things: Using 'etun', the virtual ethernet tunnel driver, and IP routes from inside a network namespace, - The throughput is the same as the normal case(*) (* normal case: no namespace, using physical adapters). No regression. Good. - But the CPU load increases a lot. Bad. The reasons are: - All checksums are done in software. No hardware offloading. - Every TCP packets going through the etun devices are duplicated in ip_forward() before we decrease the ttl. (packets are routed between both ends of etun) We also made some testing with bridges, and obtained the same results: CPU load increase: - No hardware offloading - Packets are duplicated somewhere in the bridge+netfilter code (can't remember where right now) This time, I've replaced the etun interface by the new macvlan, which should benefits from the hardware offloading capabilities of the physical adapter and suppress the forwarding stuff. My test setup is: Host AHost B _____ | _ | | | | | Netns 1 | | | | | | | | | | | | macvlan0| | | | | |___|_| | | | | || | | |_|| |___| | eth0 (192.168.0.2) | eth0 (192.168.0.1) || - macvlan0 (192.168.0.3) - netperf runs on host A - netserver runs on host B - Adapters speed is 1GB/s On this setup I ran the following netperf tests: TCP_STREAM, TCP_MAERTS, TCP_RR, UDP_STREAM, UDP_RR. Between the normal case and the net namespace + macvlan case, results are about the same for both the throughput and the local CPU load for the following test types: TCP_MAERTS, TCP_RR, UDP_STREAM, UDP_RR. macvlan looks like a very good candidate for network namespace in these cases. But, with the TCP_STREAM test, I observed the CPU load is about the same (that's what we wanted) but the throughput decreases by about 5%: from 850MB/s down to 810MB/s. I haven't investigated yet why the throughput decrease in the case. Does it come from my setup, from macvlan additional treatments, other? I don't know yet Attached to this email you'll find the raw netperf outputs for the three cases: - netperf through a physical adapter, no namespace: netperf-results-2.6.22-rc4-mm2-netns1-vanilla.txt - netperf through etun, inside a namespace: netperf-results-2.6.22-rc4-mm2-netns1-using-etun.txt - netperf through macvlan, inside a namespace: netperf-results-2.6.22-rc4-mm2-netns1-using-macvlan.txt macvlan looks promising. Regards, Benjamin -- B e n j a m i n T h e r y - BULL/DT/Open Software RD http://www.bull.com NETPERF RESULTS: the normal case : No network namespace, traffic goes through real 1GB/s physical adapters. TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.76.1 (192.168.76.1) port 0 AF_INET : +/-2.5% @ 95% conf. Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Send Recv SendRecv Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.10^6bits/s % S % S us/KB us/KB 87380 16384 140020.03 857.39 6.39 9.75 2.444 3.727 TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.76.1 (192.168.76.1) port 0 AF_INET : +/-2.5% @ 95% conf. Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Send Recv SendRecv Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.10^6bits/s % S % S us/KB us/KB 87380 16384 8738020.03 763.15 4.75 10.332.038 4.434 TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.76.1 (192.168.76.1) port 0 AF_INET : +/-2.5% @ 95% conf. Local /Remote Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem Send Recv SizeSize TimeRate local remote local remote bytes bytes bytes bytes secs.
Re: [Devel] Re: [PATCH] Virtual ethernet tunnel
David Miller wrote: From: Kirill Korotaev [EMAIL PROTECTED] Date: Thu, 07 Jun 2007 12:14:29 +0400 David Miller wrote: From: Pavel Emelianov [EMAIL PROTECTED] Date: Wed, 06 Jun 2007 19:11:38 +0400 Veth stands for Virtual ETHernet. It is a simple tunnel driver that works at the link layer and looks like a pair of ethernet devices interconnected with each other. I would suggest choosing a different name. 'veth' is also the name of the virtualized ethernet device found on IBM machines, driven by driver/net/ibmveth.[ch] AFAICS, ibmveth.c registers ethX devices, while this driver registers vethX by default, so there is no much conflict IMHO. If that's the case, veth is fine with me. I like Daniel's proposals with the tunnel or pipe thing in the name. I think it is more explicit about what the device really is. I'm currently using etun, Eric Biederman's implementation. It will be nice to have this kind of device merged. -- Benjamin -- B e n j a m i n T h e r y - BULL/DT/Open Software RD http://www.bull.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: L2 network namespace benchmarking (resend with Service Demand)
Eric W. Biederman wrote: Daniel Lezcano [EMAIL PROTECTED] writes: Hi, as suggested Rick, I added the Service Demand results to the matrix. A couple of random thoughts in trying to understand the numbers you are seeing. - Checksum offloading? You have noted that with the bridge netfilter support disabled you are still seeing additional checksum overhead. Just like you are seeing in the routing case. Is it possible the problem is simply that etun doesn't support checksum offloading, while your normal test hardware does? Looks like you are 100% correct. I feel a bit stupid I didn't think about this small difference between real NIC and etun. If I turn off checksum offloading on my physical NIC, the checksum overhead (load) measured by oprofile is about the same in both case: when running netperf through a real NIC or through an etun tunnel first. Benjamin - Tagged VLANs? Currently you have tested bridging and routing to get the packets to a network namespace. Could you test tagged vlans? I'm just curious if we have anything in the network stack today that will multiplex a NIC without measurable overhead. - Without NETNS? We should probably see if we can setup the same configuration we are testing without network namespaces (just multiple interfaces on the same machine) and see if we can still measure the same overhead. Just to confirm the overhead is not a network namespace related thing. I know we can configure the same case with bridging and I am fairly confident that we will see the same overhead without network namespaces. Of the top of my head I am insufficiently clever to think how we could configure the routing case without network namespaces, although we might be able to force it and if so it would be interesting to measure. I will work to get the etun setup races fixed and to fix whatever obvious feature deficiencies it has (like no configurable MTU support) and see if I can get that pushed upstream. That should make it easier for other people to reproduce what we are seeing. Eric ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers -- B e n j a m i n T h e r y - BULL/DT/Open Software RD http://www.bull.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: L2 network namespace benchmarking
Eric W. Biederman wrote: Daniel Lezcano [EMAIL PROTECTED] writes: [...] * When do you expect to have the network namespace into mainline ? My current goal is to finish my rebase against 2.6.linus_lastest in the next couple of days after having figured out how to deal with sysfs. Great news! I also have some questions about this updated version: - Have you integrated the bug fixes and cleanups(*) Daniel wrote for your previous netns patchset (and the few glitches I reported too)? (*) available in LXC8 patchset - Do you already have a public git repository set up for the rebase? - If it is private, any plan to make it public soon? (That would be great) I have been doing reviewing in more code then I know what to do with, and fighting some very strange bugs during the stabilization window. Which has kept me from doing additional development. Plus I have had a cold. I hope you're getting better... and you'll be able to provide us the updated patchset very soon :) [...] If I read the results right it took a 32bit machine from AMD with a gigabit interface before you could measure a throughput difference. That isn't shabby for a non-optimized code path. Indeed the throughput difference is not significant. This is very good to see that it stays constant when using the container. What I'm more worried about is the CPU load increase. But it seems we've identified some of the culprits. This afternoon I had a look at why the bridge setup isn't better than the route setup (section 2.3 and 2.4 of Daniel's report). In the bridge case, we encounter the same problems as the routes case. The oprofile profile is the same: the most demanding routines are pskb_expand_head and csum_partial_copy_generic. pskb_expand_head() is also called by skb_cow(), but this time skb_cow() is called by netfilter's nf_bridge_copy_header(). We can avoid this copy by removing option CONFIG_BRIDGE_NETFILTER. This copy is made even if netfilter is not used on the host. Maybe some optimizations can be made in netfilter's code to prevent this. Regards, Benjamin -- B e n j a m i n T h e r y - BULL/DT/Open Software RD http://www.bull.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 17/31] net: Factor out __dev_alloc_name from dev_alloc_name
Hello Eric, See comments about __dev_alloc_name() below. Regards, Benjamin Eric W. Biederman wrote: From: Eric W. Biederman [EMAIL PROTECTED] - unquoted When forcibly changing the network namespace of a device I need something that can generate a name for the device in the new namespace without overwriting the old name. __dev_alloc_name provides me that functionality. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- net/core/dev.c | 44 +--- 1 files changed, 33 insertions(+), 11 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 32fe905..fc0d2af 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -655,9 +655,10 @@ int dev_valid_name(const char *name) } /** - * dev_alloc_name - allocate a name for a device - * @dev: device + * __dev_alloc_name - allocate a name for a device + * @net: network namespace to allocate the device name in * @name: name format string + * @buf: scratch buffer and result name string * * Passed a format string - eg lt%d it will try and find a suitable * id. It scans list of devices to build up a free map, then chooses @@ -668,18 +669,13 @@ int dev_valid_name(const char *name) * Returns the number of the unit assigned or a negative errno code. */ -int dev_alloc_name(struct net_device *dev, const char *name) +static int __dev_alloc_name(net_t net, const char *name, char buf[IFNAMSIZ]) IMHO the third parameter should be: char *buf Indeed using char buf[IFNAMSIZ] is misleading because later in the routine sizeof(buf) is used (with an expected result of IFNAMSIZ). Unfortunately this is no longer the case: sizeof(buf) value is only 4 now (buf is pointer parameter). This corrupts the registration of network devices (now I understand why only one of my e1000 showed up after each reboot :). Also sizeof(buf) should be replaced by IFNAMSIZ in this new routine. (See below) { int i = 0; - char buf[IFNAMSIZ]; const char *p; const int max_netdevices = 8*PAGE_SIZE; long *inuse; struct net_device *d; - net_t net; - - BUG_ON(null_net(dev-nd_net)); - net = dev-nd_net; p = strnchr(name, IFNAMSIZ-1, '%'); if (p) { @@ -713,10 +709,8 @@ int dev_alloc_name(struct net_device *dev, const char *name) } snprintf(buf, sizeof(buf), name, i); Replace snprintf(buf, IFNAMSIZ, name, i); or i will never be appended to name and all your ethernet devices will all try to register the name eth. There is another occurence of snprintf(buf, sizeof(buf), ...) to replace in the for loop above. - if (!__dev_get_by_name(net, buf)) { - strlcpy(dev-name, buf, IFNAMSIZ); + if (!__dev_get_by_name(net, buf)) return i; - } /* It is possible to run out of possible slots * when the name is long and there isn't enough space left @@ -725,6 +719,34 @@ int dev_alloc_name(struct net_device *dev, const char *name) return -ENFILE; } +/** + * dev_alloc_name - allocate a name for a device + * @dev: device + * @name: name format string + * + * Passed a format string - eg lt%d it will try and find a suitable + * id. It scans list of devices to build up a free map, then chooses + * the first empty slot. The caller must hold the dev_base or rtnl lock + * while allocating the name and adding the device in order to avoid + * duplicates. + * Limited to bits_per_byte * page size devices (ie 32K on most platforms). + * Returns the number of the unit assigned or a negative errno code. + */ + +int dev_alloc_name(struct net_device *dev, const char *name) +{ + char buf[IFNAMSIZ]; + net_t net; + int ret; + + BUG_ON(null_net(dev-nd_net)); + net = dev-nd_net; + ret = __dev_alloc_name(net, name, buf); + if (ret = 0) + strlcpy(dev-name, buf, IFNAMSIZ); + return ret; +} + /** * dev_change_name - change name of a device -- B e n j a m i n T h e r y - BULL/DT/Open Software RD http://www.bull.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Lost packets after switching Wi-Fi AP
Jiri Benc wrote: On Mon, 30 Oct 2006 15:55:57 +0100, Benjamin Thery wrote: When I switch my Mobile Node between 2 Wi-Fi Access Points, there is a period of time where all the packets I send are lost, although I got the netlink event SIOCGIWAP 'up' for the new AP. The device is supposed to be ready, but the packets are lost. Which wireless card are you using? Which version of the kernel? Hi Jiri, The kernel version is 2.6.16.20 (the latest kernel version officially supported by MIPv6). I'd like to use a 2.6.19 but unfortunately not all the IPv6 mobility patches are in. I reproduced the problem with an Intel Pro Wireless 2200 (latest driver version: 1.2.0) and a pcmcia D-Link Airplus G+ DWL-G650+ using the ndiswrapper version 1.25. But I'm not sure the problem is wireless-specific. And as I wrote in my first message I'm also surprised that when noop_enqueue() is used, the return code is NET_XMIT_CN, whereas the packet seems to be dropped. Thanks for your help. Benjamin Jiri -- B e n j a m i n T h e r y - BULL/DT/Open Software RD http://www.bull.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Lost packets after switching Wi-Fi AP
Hello, I work on an extension of the mobility protocol for IPv6 (FMIPv6-RFC4068) and I've noticed the following problem: When I switch my Mobile Node between 2 Wi-Fi Access Points, there is a period of time where all the packets I send are lost, although I got the netlink event SIOCGIWAP 'up' for the new AP. The device is supposed to be ready, but the packets are lost. By 'lost', I mean: silently discarded by the kernel, no error returned to the user-space application sending the packet, packets never appear on the network monitored with wireshark. Here is the setup: -- 1. The daemon decides to switch from one AP to the other for some reason (better link quality, ...) and set the new ESSID, etc. 2. The daemon waits for the SIOCGIWAP 'up' netlink event. 3. SIOCGIWAP received: the daemon sends a unique Mobility Header packet using a raw socket to its new router to signal it has successfully moved. sendmsg() returned 0, no error, but the packet never shows up. - The interface has an IPv6 address configured for the new network (previously created). - There is a route between the node and the router. - I set the socket option IPV6_RECVERR to get all the errors, but none shows up. - The black hole period lasts for about 500ms after the SIOCGIWAP event. Every packets sent during this period are lost. - I tried to get the interface status before sending the packet (ioctl(SIOCSIFFLAGS)) but I got a perfect IFF_UP|IFF_RUNNING. What I've found in the kernel: -- - The packet is discarded in the packet scheduler in net/sched/sch_generic.c::noop_enqueue() which returns NET_XMIT_CN. - The error doesn't go up to the application because net/ipv6/ip6_output.c::ip6_push_pending_frames() filters this type of error (using net_xmit_errno(err)). - noop_enqueue() is used to enqueue the packet because the device has been deactivated by link_watch_run_queue() calling dev_deactivate(). The device is re-activated about 500ms later. - According to net/sched/sch_api.c, NET_XMIT_CN means probably this packet enqueued, but another one dropped. But it seems to me that this packet IS actually dropped in noop_enqueue() (kfree_skb()). Shouldn't the routine return NET_XMIT_DROP instead? Then the application should be able to get the error code when the device is deactivated and retry later? My questions: - - Am I doing something obviously wrong? Is there another event I should expect before sending my packet? An event that signals more reliably that the link is up and running and associated with the new AP? - Shouldn't we change the return code in noop_enqueue()? Thanks a lot for your help, Benjamin -- B e n j a m i n T h e r y - BULL/DT/Open Software RD http://www.bull.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html