64bit counters: iproute(netlink) vs ifconfig
Hello, Just discovered that counters returned by the ip tool are truncated: # ip -s link show bond0 1: bond0: BROADCAST,MULTICAST,MASTER,UP,LOWER_UP mtu 1500 qdisc noqueue link/ether 00:1d:09:67:6e:2f brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 2485605521 9010211 0 0 0 6 TX: bytes packets errors dropped carrier collsns 3023237974 9345397 0 0 0 0 # ifconfig bond0 bond0 Link encap:Ethernet HWaddr 00:1D:09:67:6E:2F inet addr:192.168.152.62 Bcast:192.168.152.255 Mask:255.255.255.0 inet6 addr: fe80::21d:9ff:fe67:6e2f/64 Scope:Link UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 RX packets:9010367 errors:0 dropped:351 overruns:0 frame:0 TX packets:9345521 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:2485631020 (2370.4 Mb) TX bytes:7318232294 (6979.2 Mb) Is it possible to get 64-bit counters in ip via netlink? Struct rtnl_link_stats does not look very optimistic, as it has rx_bytes/tx_bytes defined with __u32. Best regards, Krzysztof Olędzki
Re: [PATCH] ipvs: Make the synchronization interval controllable
On Fri, 8 Feb 2008, Andi Kleen wrote: Sven Wegener [EMAIL PROTECTED] writes: The default synchronization interval of 1000 milliseconds is too high for a heavily loaded director. Collecting the connection information from one second and then sending it out in a burst will overflow the socket buffer and lead to synchronization information being dropped. Make the interval controllable by a sysctl variable so that users can tune it. It would be better if the defaults just worked under all circumstances. So why not just lower the default? Or the code could detect overflowing socket buffers and lower the value dynamically. We can also start sending when amount of data reaches defined level. Best regards, Krzysztof Olędzki
Re: [patch for 2.6.24? 1/1] bonding: locking fix
On Thu, 17 Jan 2008, Jay Vosburgh wrote: Krzysztof Oledzki [EMAIL PROTECTED] wrote: Andrew Morton [EMAIL PROTECTED] wrote: [...] Can we get this bug fixed please? Today? It has been known about for more than two months. I just reposted the complete fix; it's #1 of the series of 7. Bad news. :( 2.6.24-rc7 + patch #1 (bonding: fix locking in sysfs primary/active selection): [...] = [ INFO: possible irq lock inversion dependency detected ] 2.6.24-rc7 #1 - events/0/9 just changed the state of lock: (mc-mca_lock){-+..}, at: [c041255a] mld_ifc_timer_expire+0x130/0x1fb but this lock took another, soft-read-irq-unsafe lock in the past: (bond-lock){-.--} None of the seven patches I posted just a bit ago will fix this lockdep warning (which is a different thing that the bug Andrew inquired about); I'm still working on that one. For that one, I had posted this work in progress patch: Yes, this one works. which makes the warning go away, but Herbert Xu pointed out that there is a potential problem with bond_enslave accessing the mc_lists without sufficient locking. It's not the only offender, either, and the bond-mc_list references really need to be protected by the bond_lock, and the whole thing probably ought to use dev_mc_sync/unsync instead of what it does now. Since the bond_enslave, et al, business isn't a new problem, and I've never heard of it being hit, I'm thinking now to just leave the bond_enslave part for 2.6.25, and fix the lockdep warning for 2.6.24. It is a new problem, as it never happened with =2.6.23. Best regards, Krzysztof Olędzki
Re: [PATCH 0/3] bonding: 3 fixes for 2.6.24
On Sat, 12 Jan 2008, Jay Vosburgh wrote: Krzysztof Oledzki [EMAIL PROTECTED] wrote: [...] Exactly. All I need to do is to reboot my server, I have 100% probability to get the warning. I wish it were that easy for me; I'm not sure what magic thing you've got on your server or network that I don't, but I haven't been able to make this lockdep warning happen at all. Right. So, what is the final patch? I would like to test it if that's possible. ;) Can you test the following and let me know if it triggers the warning? I believe this is the minimum locking needed, and based on input from Herbert, we shouldn't need to hold the lock at _bh. If this one works, and nobody sees any other issues with it, then it's the final patch for this lockdep problem. I'll add some deep, meaningful comments to explain the locking a bit (i.e., we're called with rtnl for the allmulti and promisc cases, so we're ok there without additional locks, but the later code could be called from anywhere, so it needs locks to prevent the slave list from changing, but the mc_lists themselves are covered by the netif_tx_lock that all callers will hold), but this would be the actual code change. diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 77d004d..6906dbc 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -3937,8 +3937,6 @@ static void bond_set_multicast_list(struct net_device *bond_dev) struct bonding *bond = bond_dev-priv; struct dev_mc_list *dmi; - write_lock_bh(bond-lock); - /* * Do promisc before checking multicast_mode */ @@ -3959,6 +3957,8 @@ static void bond_set_multicast_list(struct net_device *bond_dev) bond_set_allmulti(bond, -1); } + read_lock(bond-lock); + bond-flags = bond_dev-flags; /* looking for addresses to add to slaves' mc list */ @@ -3979,7 +3979,7 @@ static void bond_set_multicast_list(struct net_device *bond_dev) bond_mc_list_destroy(bond); bond_mc_list_copy(bond_dev-mc_list, bond, GFP_ATOMIC); - write_unlock_bh(bond-lock); + read_unlock(bond-lock); } /* I can confirm that the warning went away. Tested-by: Krzysztof Piotr Oledzki [EMAIL PROTECTED] Best regards, Krzysztof Olędzki
Re: [PATCH 0/3] bonding: 3 fixes for 2.6.24
On Wed, 9 Jan 2008, Andy Gospodarek wrote: On Wed, Jan 09, 2008 at 09:54:56AM -0800, Jay Vosburgh wrote: CUT This should silence the lockdep (if I'm understanding what everybody's saying), and keep the change set to a minimum. This might The lockdep problem is easy to trigger. The lockdep code does a good job of noticing problems quickly regardless of how easy the deadlocks are to create. Exactly. All I need to do is to reboot my server, I have 100% probability to get the warning. not even be worth pushing for 2.6.24; I'm not exactly sure how difficult the lockdep problem would be to trigger. I'd like to see it go in there (for correct-ness) and to avoid hearing about these lockdep issues for the next few months until it makes it into 2.4.25. Right. So, what is the final patch? I would like to test it if that's possible. ;) Best regards, Krzysztof Olędzki
Re: [PATCH 0/3] bonding: 3 fixes for 2.6.24
On Tue, 8 Jan 2008, Jay Vosburgh wrote: Krzysztof Oledzki [EMAIL PROTECTED] wrote: Fine. Just let you know that someone test your patches and everything works, except mentioned problem. And I appreciate it; I just wanted to make sure our many fans following along at home didn't misunderstand. Could you let me know if the patch below make the lockdep warning go away? This applies on top of the previous three, although it should be trivial to do by hand. I'm still checking to make sure this is safe with regard to mutexing the bonding structures, but it would be good to know if it eliminates the warning. I can confirm that the warning went away. Tested-by: Krzysztof Piotr Oledzki [EMAIL PROTECTED] Best regards, Krzysztof Olędzki
Re: [PATCH 0/3] bonding: 3 fixes for 2.6.24
On Mon, 7 Jan 2008, Jay Vosburgh wrote: Following are three fixes to fix locking problems and silence locking-related warnings in the current 2.6.24-rc. patch 1: fix locking in sysfs primary/active selection Call core network functions with expected locks to eliminate potential deadlock and silence warnings. patch 2: fix ASSERT_RTNL that produces spurious warnings Relocate ASSERT_RTNL to remove a false warning; after patch, ASSERT is located in code that holds only RTNL (additional locks were causing the ASSERT to trip) patch 3: fix locking during alb failover and slave removal Fix all call paths into alb_fasten_mac_swap to hold only RTNL. Eliminates deadlock and silences warnings. Patches are against the current netdev-2.6#upstream branch. Please apply for 2.6.24. 2.6.24-rc7 + patches #1, #2, #3: bonding: bond0: setting mode to active-backup (1). bonding: bond0: Setting MII monitoring interval to 100. ADDRCONF(NETDEV_UP): bond0: link is not ready bonding: bond0: Adding slave eth0. e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX bonding: bond0: making interface eth0 the new active one. bonding: bond0: first active interface up! bonding: bond0: enslaving eth0 as an active interface with an up link. bonding: bond0: Adding slave eth1. ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready = [ INFO: possible irq lock inversion dependency detected ] 2.6.24-rc7 #1 - events/0/9 just changed the state of lock: (mc-mca_lock){-+..}, at: [c041258e] mld_ifc_timer_expire+0x130/0x1fb but this lock took another, soft-read-irq-unsafe lock in the past: (bond-lock){-.--} and interrupts could create inverse lock ordering between them. other info that might help us debug this: 4 locks held by events/0/9: #0: (events){--..}, at: [c0133d33] run_workqueue+0x87/0x1b6 #1: ((linkwatch_work).work){--..}, at: [c0133d33] run_workqueue+0x87/0x1b6 #2: (rtnl_mutex){--..}, at: [c03ac678] linkwatch_event+0x5/0x22 #3: (ndev-lock){-.-+}, at: [c0412475] mld_ifc_timer_expire+0x17/0x1fb the first lock's dependencies: - (mc-mca_lock){-+..} ops: 10 { initial-use at: [c0104ee2] dump_trace+0x83/0x8d [c0142890] __lock_acquire+0x4ba/0xc07 [c0109ef2] save_stack_trace+0x20/0x3a [c0142f95] __lock_acquire+0xbbf/0xc07 [c0412d66] ipv6_dev_mc_inc+0x24d/0x31c [c0143056] lock_acquire+0x79/0x93 [c04129ea] igmp6_group_added+0x18/0x11d [c043a8aa] _spin_lock_bh+0x3b/0x64 [c04129ea] igmp6_group_added+0x18/0x11d [c04129ea] igmp6_group_added+0x18/0x11d [c0141f93] trace_hardirqs_on+0x122/0x14c [c0412dbc] ipv6_dev_mc_inc+0x2a3/0x31c [c0412d66] ipv6_dev_mc_inc+0x24d/0x31c [c0412df1] ipv6_dev_mc_inc+0x2d8/0x31c [c0412b19] ipv6_dev_mc_inc+0x0/0x31c [c0402168] ipv6_add_dev+0x21c/0x24b [c040b991] ndisc_ifinfo_sysctl_change+0x0/0x1ef [c05c5ae9] addrconf_init+0x13/0x193 [c019a04b] proc_net_fops_create+0x10/0x21 [c041a44c] ip6_flowlabel_init+0x1e/0x20 [c05c59c9] inet6_init+0x1f0/0x2ad [c05a9499] kernel_init+0x150/0x2b7 [c05a9349] kernel_init+0x0/0x2b7 [c05a9349] kernel_init+0x0/0x2b7 [c0104baf] kernel_thread_helper+0x7/0x10 [] 0x in-softirq-W at: [c014197a] mark_lock+0x64/0x451 [c0142816] __lock_acquire+0x440/0xc07 [c0103f7b] restore_nocheck+0x12/0x15 [c0143056] lock_acquire+0x79/0x93 [c041258e] mld_ifc_timer_expire+0x130/0x1fb [c041245e] mld_ifc_timer_expire+0x0/0x1fb [c043a8aa] _spin_lock_bh+0x3b/0x64 [c041258e] mld_ifc_timer_expire+0x130/0x1fb [c041258e] mld_ifc_timer_expire+0x130/0x1fb [c041245e] mld_ifc_timer_expire+0x0/0x1fb [c0141f7d] trace_hardirqs_on+0x10c/0x14c [c041245e] mld_ifc_timer_expire+0x0/0x1fb [c012e02e] run_timer_softirq+0xfa/0x15d [c012a982] __do_softirq+0x56/0xdb [c0141f7d] trace_hardirqs_on+0x10c/0x14c [c012a994] __do_softirq+0x68/0xdb [c012aa3d] do_softirq+0x36/0x51
Re: [PATCH 0/3] bonding: 3 fixes for 2.6.24
On Tue, 8 Jan 2008, Jay Vosburgh wrote: Krzysztof Oledzki [EMAIL PROTECTED] wrote: On Mon, 7 Jan 2008, Jay Vosburgh wrote: Following are three fixes to fix locking problems and silence locking-related warnings in the current 2.6.24-rc. patch 1: fix locking in sysfs primary/active selection Call core network functions with expected locks to eliminate potential deadlock and silence warnings. patch 2: fix ASSERT_RTNL that produces spurious warnings Relocate ASSERT_RTNL to remove a false warning; after patch, ASSERT is located in code that holds only RTNL (additional locks were causing the ASSERT to trip) patch 3: fix locking during alb failover and slave removal Fix all call paths into alb_fasten_mac_swap to hold only RTNL. Eliminates deadlock and silences warnings. Patches are against the current netdev-2.6#upstream branch. Please apply for 2.6.24. 2.6.24-rc7 + patches #1, #2, #3: bonding: bond0: setting mode to active-backup (1). bonding: bond0: Setting MII monitoring interval to 100. ADDRCONF(NETDEV_UP): bond0: link is not ready bonding: bond0: Adding slave eth0. e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX bonding: bond0: making interface eth0 the new active one. bonding: bond0: first active interface up! bonding: bond0: enslaving eth0 as an active interface with an up link. bonding: bond0: Adding slave eth1. ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready = [ INFO: possible irq lock inversion dependency detected ] 2.6.24-rc7 #1 - events/0/9 just changed the state of lock: (mc-mca_lock){-+..}, at: [c041258e] mld_ifc_timer_expire+0x130/0x1fb but this lock took another, soft-read-irq-unsafe lock in the past: (bond-lock){-.--} and interrupts could create inverse lock ordering between them. Just to be clear: the patch set I posted yesterday was not intended to resolve the lockdep problem; I haven't studied that one yet. Fine. Just let you know that someone test your patches and everything works, except mentioned problem. Best regards, Krzysztof Olędzki
Re: [Bugme-new] [Bug 9543] New: RTNL: assertion failed at net/ipv6/addrconf.c (2164)/RTNL: assertion failed at net/ipv4/devinet.c (1055)
On Wed, 19 Dec 2007, Andy Gospodarek wrote: On Tue, Dec 18, 2007 at 08:53:39PM +0100, Krzysztof Oledzki wrote: On Fri, 14 Dec 2007, Andy Gospodarek wrote: On Fri, Dec 14, 2007 at 07:57:42PM +0100, Krzysztof Oledzki wrote: On Fri, 14 Dec 2007, Andy Gospodarek wrote: On Fri, Dec 14, 2007 at 05:14:57PM +0100, Krzysztof Oledzki wrote: On Wed, 12 Dec 2007, Jay Vosburgh wrote: Herbert Xu [EMAIL PROTECTED] wrote: diff -puN drivers/net/bonding/bond_sysfs.c~bonding-locking-fix drivers/net/bonding/bond_sysfs.c --- a/drivers/net/bonding/bond_sysfs.c~bonding-locking-fix +++ a/drivers/net/bonding/bond_sysfs.c @@ -,8 +,6 @@ static ssize_t bonding_store_primary(str out: write_unlock_bh(bond-lock); - rtnl_unlock(); - Looking at the changeset that added this perhaps the intention is to hold the lock? If so we should add an rtnl_lock to the start of the function. Yes, this function needs to hold locks, and more than just what's there now. I believe the following should be correct; I haven't tested it, though (I'm supposedly on vacation right now). The following change should be correct for the bonding_store_primary case discussed in this thread, and also corrects the bonding_store_active case which performs similar functions. The bond_change_active_slave and bond_select_active_slave functions both require rtnl, bond-lock for read and curr_slave_lock for write_bh, and no other locks. This is so that the lower level mode-specific functions can release locks down to just rtnl in order to call, e.g., dev_set_mac_address with the locks it expects (rtnl only). Signed-off-by: Jay Vosburgh [EMAIL PROTECTED] diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c index 11b76b3..28a2d80 100644 --- a/drivers/net/bonding/bond_sysfs.c +++ b/drivers/net/bonding/bond_sysfs.c @@ -1075,7 +1075,10 @@ static ssize_t bonding_store_primary(struct device *d, struct slave *slave; struct bonding *bond = to_bond(d); - write_lock_bh(bond-lock); + rtnl_lock(); + read_lock(bond-lock); + write_lock_bh(bond-curr_slave_lock); + if (!USES_PRIMARY(bond-params.mode)) { printk(KERN_INFO DRV_NAME : %s: Unable to set primary slave; %s is in mode %d\n, @@ -1109,8 +1112,8 @@ static ssize_t bonding_store_primary(struct device *d, } } out: - write_unlock_bh(bond-lock); - + write_unlock_bh(bond-curr_slave_lock); + read_unlock(bond-lock); rtnl_unlock(); return count; @@ -1190,7 +1193,8 @@ static ssize_t bonding_store_active_slave(struct device *d, struct bonding *bond = to_bond(d); rtnl_lock(); - write_lock_bh(bond-lock); + read_lock(bond-lock); + write_lock_bh(bond-curr_slave_lock); if (!USES_PRIMARY(bond-params.mode)) { printk(KERN_INFO DRV_NAME @@ -1247,7 +1251,8 @@ static ssize_t bonding_store_active_slave(struct device *d, } } out: - write_unlock_bh(bond-lock); + write_unlock_bh(bond-curr_slave_lock); + read_unlock(bond-lock); rtnl_unlock(); return count; Vanilla 2.6.24-rc5 plus this patch: = [ INFO: possible irq lock inversion dependency detected ] 2.6.24-rc5 #1 - events/0/9 just changed the state of lock: (mc-mca_lock){-+..}, at: [c0411c7a] mld_ifc_timer_expire+0x130/0x1fb but this lock took another, soft-read-irq-unsafe lock in the past: (bond-lock){-.--} and interrupts could create inverse lock ordering between them. Grrr, I should have seen that -- sorry. Try your luck with this instead: CUT No luck. I'm guessing if we go back to using a write-lock for bond-lock this will go back to working again, but I'm not totally convinced since there are plenty of places where we used a read-lock with it. Should I check this patch or rather, based on a future discussion, wait for another version? diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c index 11b76b3..635b857 100644 --- a/drivers/net/bonding/bond_sysfs.c +++ b/drivers/net/bonding/bond_sysfs.c @@ -1075,7 +1075,10 @@ static ssize_t bonding_store_primary(struct device *d, struct slave *slave; struct bonding *bond = to_bond(d); + rtnl_lock(); write_lock_bh(bond-lock); + write_lock_bh(bond-curr_slave_lock); + if (!USES_PRIMARY(bond-params.mode)) { printk(KERN_INFO DRV_NAME : %s: Unable to set primary slave; %s is in mode %d\n, @@ -1109,8 +1112,8 @@ static ssize_t bonding_store_primary(struct device *d, } } out: + write_unlock_bh(bond-curr_slave_lock); write_unlock_bh(bond-lock); - rtnl_unlock(); return count
Re: [PATCH] sky2: Use deferrable timer for watchdog
On Thu, 20 Dec 2007, Parag Warudkar wrote: On Dec 20, 2007 2:22 PM, Kok, Auke [EMAIL PROTECTED] wrote: ok, that's just bad and if there's no user-defineable limit to the deferral I definately don't like this change. Can I safely assume that any irq will cause all deferred timers to run? I think even other causes for wakeup like process related ones will cause the CPU to go busy and run the timers. This, coupled with the fact that no one is yet able to reach 0 wakeups per second makes it pretty unlikely that deferrable timers will be deferred indefinitely. If this is the case then for e1000 this patch is still OK since the watchdog needs to run (1) after a link up/down interrupt or (2) to update statistics. Those statistics won't increase if there is no traffic of course... I think it is reasonable for Network driver watchdogs to use a deferrable timer - if the machine is 100% IDLE there is no one needing the network to be up. Please note tha being connected to a network does not only mean to send but also to receive. Best regards, Krzysztof Oledzki -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 9543] New: RTNL: assertion failed at net/ipv6/addrconf.c (2164)/RTNL: assertion failed at net/ipv4/devinet.c (1055)
On Fri, 14 Dec 2007, Andy Gospodarek wrote: On Fri, Dec 14, 2007 at 07:57:42PM +0100, Krzysztof Oledzki wrote: On Fri, 14 Dec 2007, Andy Gospodarek wrote: On Fri, Dec 14, 2007 at 05:14:57PM +0100, Krzysztof Oledzki wrote: On Wed, 12 Dec 2007, Jay Vosburgh wrote: Herbert Xu [EMAIL PROTECTED] wrote: diff -puN drivers/net/bonding/bond_sysfs.c~bonding-locking-fix drivers/net/bonding/bond_sysfs.c --- a/drivers/net/bonding/bond_sysfs.c~bonding-locking-fix +++ a/drivers/net/bonding/bond_sysfs.c @@ -,8 +,6 @@ static ssize_t bonding_store_primary(str out: write_unlock_bh(bond-lock); - rtnl_unlock(); - Looking at the changeset that added this perhaps the intention is to hold the lock? If so we should add an rtnl_lock to the start of the function. Yes, this function needs to hold locks, and more than just what's there now. I believe the following should be correct; I haven't tested it, though (I'm supposedly on vacation right now). The following change should be correct for the bonding_store_primary case discussed in this thread, and also corrects the bonding_store_active case which performs similar functions. The bond_change_active_slave and bond_select_active_slave functions both require rtnl, bond-lock for read and curr_slave_lock for write_bh, and no other locks. This is so that the lower level mode-specific functions can release locks down to just rtnl in order to call, e.g., dev_set_mac_address with the locks it expects (rtnl only). Signed-off-by: Jay Vosburgh [EMAIL PROTECTED] diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c index 11b76b3..28a2d80 100644 --- a/drivers/net/bonding/bond_sysfs.c +++ b/drivers/net/bonding/bond_sysfs.c @@ -1075,7 +1075,10 @@ static ssize_t bonding_store_primary(struct device *d, struct slave *slave; struct bonding *bond = to_bond(d); - write_lock_bh(bond-lock); + rtnl_lock(); + read_lock(bond-lock); + write_lock_bh(bond-curr_slave_lock); + if (!USES_PRIMARY(bond-params.mode)) { printk(KERN_INFO DRV_NAME : %s: Unable to set primary slave; %s is in mode %d\n, @@ -1109,8 +1112,8 @@ static ssize_t bonding_store_primary(struct device *d, } } out: - write_unlock_bh(bond-lock); - + write_unlock_bh(bond-curr_slave_lock); + read_unlock(bond-lock); rtnl_unlock(); return count; @@ -1190,7 +1193,8 @@ static ssize_t bonding_store_active_slave(struct device *d, struct bonding *bond = to_bond(d); rtnl_lock(); - write_lock_bh(bond-lock); + read_lock(bond-lock); + write_lock_bh(bond-curr_slave_lock); if (!USES_PRIMARY(bond-params.mode)) { printk(KERN_INFO DRV_NAME @@ -1247,7 +1251,8 @@ static ssize_t bonding_store_active_slave(struct device *d, } } out: - write_unlock_bh(bond-lock); + write_unlock_bh(bond-curr_slave_lock); + read_unlock(bond-lock); rtnl_unlock(); return count; Vanilla 2.6.24-rc5 plus this patch: = [ INFO: possible irq lock inversion dependency detected ] 2.6.24-rc5 #1 - events/0/9 just changed the state of lock: (mc-mca_lock){-+..}, at: [c0411c7a] mld_ifc_timer_expire+0x130/0x1fb but this lock took another, soft-read-irq-unsafe lock in the past: (bond-lock){-.--} and interrupts could create inverse lock ordering between them. Grrr, I should have seen that -- sorry. Try your luck with this instead: CUT No luck. I'm guessing if we go back to using a write-lock for bond-lock this will go back to working again, but I'm not totally convinced since there are plenty of places where we used a read-lock with it. Should I check this patch or rather, based on a future discussion, wait for another version? diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c index 11b76b3..635b857 100644 --- a/drivers/net/bonding/bond_sysfs.c +++ b/drivers/net/bonding/bond_sysfs.c @@ -1075,7 +1075,10 @@ static ssize_t bonding_store_primary(struct device *d, struct slave *slave; struct bonding *bond = to_bond(d); + rtnl_lock(); write_lock_bh(bond-lock); + write_lock_bh(bond-curr_slave_lock); + if (!USES_PRIMARY(bond-params.mode)) { printk(KERN_INFO DRV_NAME : %s: Unable to set primary slave; %s is in mode %d\n, @@ -1109,8 +1112,8 @@ static ssize_t bonding_store_primary(struct device *d, } } out: + write_unlock_bh(bond-curr_slave_lock); write_unlock_bh(bond-lock); - rtnl_unlock(); return count; @@ -1191,6 +1194,7 @@ static ssize_t bonding_store_active_slave(struct device *d, rtnl_lock(); write_lock_bh(bond-lock
Re: [Bugme-new] [Bug 9543] New: RTNL: assertion failed at net/ipv6/addrconf.c (2164)/RTNL: assertion failed at net/ipv4/devinet.c (1055)
On Wed, 12 Dec 2007, Jay Vosburgh wrote: Herbert Xu [EMAIL PROTECTED] wrote: diff -puN drivers/net/bonding/bond_sysfs.c~bonding-locking-fix drivers/net/bonding/bond_sysfs.c --- a/drivers/net/bonding/bond_sysfs.c~bonding-locking-fix +++ a/drivers/net/bonding/bond_sysfs.c @@ -,8 +,6 @@ static ssize_t bonding_store_primary(str out: write_unlock_bh(bond-lock); - rtnl_unlock(); - Looking at the changeset that added this perhaps the intention is to hold the lock? If so we should add an rtnl_lock to the start of the function. Yes, this function needs to hold locks, and more than just what's there now. I believe the following should be correct; I haven't tested it, though (I'm supposedly on vacation right now). The following change should be correct for the bonding_store_primary case discussed in this thread, and also corrects the bonding_store_active case which performs similar functions. The bond_change_active_slave and bond_select_active_slave functions both require rtnl, bond-lock for read and curr_slave_lock for write_bh, and no other locks. This is so that the lower level mode-specific functions can release locks down to just rtnl in order to call, e.g., dev_set_mac_address with the locks it expects (rtnl only). Signed-off-by: Jay Vosburgh [EMAIL PROTECTED] diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c index 11b76b3..28a2d80 100644 --- a/drivers/net/bonding/bond_sysfs.c +++ b/drivers/net/bonding/bond_sysfs.c @@ -1075,7 +1075,10 @@ static ssize_t bonding_store_primary(struct device *d, struct slave *slave; struct bonding *bond = to_bond(d); - write_lock_bh(bond-lock); + rtnl_lock(); + read_lock(bond-lock); + write_lock_bh(bond-curr_slave_lock); + if (!USES_PRIMARY(bond-params.mode)) { printk(KERN_INFO DRV_NAME : %s: Unable to set primary slave; %s is in mode %d\n, @@ -1109,8 +1112,8 @@ static ssize_t bonding_store_primary(struct device *d, } } out: - write_unlock_bh(bond-lock); - + write_unlock_bh(bond-curr_slave_lock); + read_unlock(bond-lock); rtnl_unlock(); return count; @@ -1190,7 +1193,8 @@ static ssize_t bonding_store_active_slave(struct device *d, struct bonding *bond = to_bond(d); rtnl_lock(); - write_lock_bh(bond-lock); + read_lock(bond-lock); + write_lock_bh(bond-curr_slave_lock); if (!USES_PRIMARY(bond-params.mode)) { printk(KERN_INFO DRV_NAME @@ -1247,7 +1251,8 @@ static ssize_t bonding_store_active_slave(struct device *d, } } out: - write_unlock_bh(bond-lock); + write_unlock_bh(bond-curr_slave_lock); + read_unlock(bond-lock); rtnl_unlock(); return count; Vanilla 2.6.24-rc5 plus this patch: = [ INFO: possible irq lock inversion dependency detected ] 2.6.24-rc5 #1 - events/0/9 just changed the state of lock: (mc-mca_lock){-+..}, at: [c0411c7a] mld_ifc_timer_expire+0x130/0x1fb but this lock took another, soft-read-irq-unsafe lock in the past: (bond-lock){-.--} and interrupts could create inverse lock ordering between them. other info that might help us debug this: 4 locks held by events/0/9: #0: (events){--..}, at: [c0133c57] run_workqueue+0x87/0x1b6 #1: ((linkwatch_work).work){--..}, at: [c0133c57] run_workqueue+0x87/0x1b6 #2: (rtnl_mutex){--..}, at: [c03abd50] linkwatch_event+0x5/0x22 #3: (ndev-lock){-.-+}, at: [c0411b61] mld_ifc_timer_expire+0x17/0x1fb the first lock's dependencies: - (mc-mca_lock){-+..} ops: 10 { initial-use at: [c0104ee2] dump_trace+0x83/0x8d [c014289c] __lock_acquire+0x4ba/0xc07 [c0109ef2] save_stack_trace+0x20/0x3a [c0142fa1] __lock_acquire+0xbbf/0xc07 [c0412452] ipv6_dev_mc_inc+0x24d/0x31c [c0143062] lock_acquire+0x79/0x93 [c04120d6] igmp6_group_added+0x18/0x11d [c0439d62] _spin_lock_bh+0x3b/0x64 [c04120d6] igmp6_group_added+0x18/0x11d [c04120d6] igmp6_group_added+0x18/0x11d [c0141f9f] trace_hardirqs_on+0x122/0x14c [c04124a8] ipv6_dev_mc_inc+0x2a3/0x31c [c0412452] ipv6_dev_mc_inc+0x24d/0x31c [c04124dd] ipv6_dev_mc_inc+0x2d8/0x31c [c0412205] ipv6_dev_mc_inc+0x0/0x31c [c0401834] ipv6_add_dev+0x21c/0x24b [c040b07d] ndisc_ifinfo_sysctl_change+0x0/0x1ef [c05c5b40] addrconf_init+0x13/0x193 [c0199f63] proc_net_fops_create+0x10/0x21
Re: [Bugme-new] [Bug 9543] New: RTNL: assertion failed at net/ipv6/addrconf.c (2164)/RTNL: assertion failed at net/ipv4/devinet.c (1055)
On Fri, 14 Dec 2007, Andy Gospodarek wrote: On Fri, Dec 14, 2007 at 07:57:42PM +0100, Krzysztof Oledzki wrote: On Fri, 14 Dec 2007, Andy Gospodarek wrote: On Fri, Dec 14, 2007 at 05:14:57PM +0100, Krzysztof Oledzki wrote: On Wed, 12 Dec 2007, Jay Vosburgh wrote: Herbert Xu [EMAIL PROTECTED] wrote: diff -puN drivers/net/bonding/bond_sysfs.c~bonding-locking-fix drivers/net/bonding/bond_sysfs.c --- a/drivers/net/bonding/bond_sysfs.c~bonding-locking-fix +++ a/drivers/net/bonding/bond_sysfs.c @@ -,8 +,6 @@ static ssize_t bonding_store_primary(str out: write_unlock_bh(bond-lock); - rtnl_unlock(); - Looking at the changeset that added this perhaps the intention is to hold the lock? If so we should add an rtnl_lock to the start of the function. Yes, this function needs to hold locks, and more than just what's there now. I believe the following should be correct; I haven't tested it, though (I'm supposedly on vacation right now). The following change should be correct for the bonding_store_primary case discussed in this thread, and also corrects the bonding_store_active case which performs similar functions. The bond_change_active_slave and bond_select_active_slave functions both require rtnl, bond-lock for read and curr_slave_lock for write_bh, and no other locks. This is so that the lower level mode-specific functions can release locks down to just rtnl in order to call, e.g., dev_set_mac_address with the locks it expects (rtnl only). Signed-off-by: Jay Vosburgh [EMAIL PROTECTED] diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c index 11b76b3..28a2d80 100644 --- a/drivers/net/bonding/bond_sysfs.c +++ b/drivers/net/bonding/bond_sysfs.c @@ -1075,7 +1075,10 @@ static ssize_t bonding_store_primary(struct device *d, struct slave *slave; struct bonding *bond = to_bond(d); - write_lock_bh(bond-lock); + rtnl_lock(); + read_lock(bond-lock); + write_lock_bh(bond-curr_slave_lock); F + if (!USES_PRIMARY(bond-params.mode)) { printk(KERN_INFO DRV_NAME : %s: Unable to set primary slave; %s is in mode %d\n, @@ -1109,8 +1112,8 @@ static ssize_t bonding_store_primary(struct device *d, } } out: - write_unlock_bh(bond-lock); - + write_unlock_bh(bond-curr_slave_lock); + read_unlock(bond-lock); rtnl_unlock(); return count; @@ -1190,7 +1193,8 @@ static ssize_t bonding_store_active_slave(struct device *d, struct bonding *bond = to_bond(d); rtnl_lock(); - write_lock_bh(bond-lock); + read_lock(bond-lock); + write_lock_bh(bond-curr_slave_lock); if (!USES_PRIMARY(bond-params.mode)) { printk(KERN_INFO DRV_NAME @@ -1247,7 +1251,8 @@ static ssize_t bonding_store_active_slave(struct device *d, } } out: - write_unlock_bh(bond-lock); + write_unlock_bh(bond-curr_slave_lock); + read_unlock(bond-lock); rtnl_unlock(); return count; Vanilla 2.6.24-rc5 plus this patch: = [ INFO: possible irq lock inversion dependency detected ] 2.6.24-rc5 #1 - events/0/9 just changed the state of lock: (mc-mca_lock){-+..}, at: [c0411c7a] mld_ifc_timer_expire+0x130/0x1fb but this lock took another, soft-read-irq-unsafe lock in the past: (bond-lock){-.--} and interrupts could create inverse lock ordering between them. Grrr, I should have seen that -- sorry. Try your luck with this instead: CUT No luck. bonding: bond0: setting mode to active-backup (1). bonding: bond0: Setting MII monitoring interval to 100. ADDRCONF(NETDEV_UP): bond0: link is not ready bonding: bond0: Adding slave eth0. e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX bonding: bond0: making interface eth0 the new active one. bonding: bond0: first active interface up! bonding: bond0: enslaving eth0 as an active interface with an up link. bonding: bond0: Adding slave eth1. ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready SNIP bonding: bond0: enslaving eth1 as a backup interface with a down link. bonding: bond0: Setting eth0 as primary slave. bond0: no IPv6 routers present Based on the console log, I'm guessing your initialization scripts use sysfs to set eth0 as the primary interface for bond0? Can you confirm? Yep, that's correct: postup() { if [[ ${IFACE} == bond0 ]] ; then echo -n +eth0 /sys/class/net/${IFACE}/bonding/slaves echo -n +eth1 /sys/class/net/${IFACE}/bonding/slaves echo -n eth0 /sys/class/net/${IFACE}/bonding/primary fi } If you did somehow use sysfs to set the primary device as eth0, I'm guessing you never see this issue without that line or without this patch
Re: [Bugme-new] [Bug 9543] New: RTNL: assertion failed at net/ipv6/addrconf.c (2164)/RTNL: assertion failed at net/ipv4/devinet.c (1055)
On Tue, 11 Dec 2007, Andrew Morton wrote: On Tue, 11 Dec 2007 03:20:48 -0800 (PST) [EMAIL PROTECTED] wrote: http://bugzilla.kernel.org/show_bug.cgi?id=9543 Summary: RTNL: assertion failed at net/ipv6/addrconf.c (2164)/RTNL: assertion failed at net/ipv4/devinet.c (1055) Product: Drivers Version: 2.5 KernelVersion: 2.6.24-rc4-git7 Platform: All OS/Version: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: Network AssignedTo: [EMAIL PROTECTED] ReportedBy: [EMAIL PROTECTED] Most recent kernel where this bug did not occur: 2.6.23 Distribution: Gentoo Problem Description: ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready RTNL: assertion failed at net/ipv6/addrconf.c (2164) Pid: 9, comm: events/0 Not tainted 2.6.24-rc4-git7 #1 [78402cfb] addrconf_notify+0x5b4/0x7b7 [7812203a] finish_task_switch+0x0/0x8c [781346ff] worker_thread+0x0/0x85 [78438e23] schedule+0x545/0x55f [781408d1] print_lock_contention_bug+0x11/0xd2 [783bfa72] rt_run_flush+0x43/0x8b [783bfa93] rt_run_flush+0x64/0x8b [7813ac54] notifier_call_chain+0x2a/0x52 [7813ac9e] raw_notifier_call_chain+0x17/0x1a [783a3471] netdev_state_change+0x18/0x29 [783ac6a9] __linkwatch_run_queue+0x150/0x17e [783ac6f4] linkwatch_event+0x1d/0x22 [78133cdf] run_workqueue+0xdb/0x1b6 [78133c8b] run_workqueue+0x87/0x1b6 [783ac6d7] linkwatch_event+0x0/0x22 [781346ff] worker_thread+0x0/0x85 [78134778] worker_thread+0x79/0x85 [781371ad] autoremove_wake_function+0x0/0x35 [781370f6] kthread+0x38/0x5e [781370be] kthread+0x0/0x5e [78104baf] kernel_thread_helper+0x7/0x10 === RTNL: assertion failed at net/ipv6/addrconf.c (1610) Hopefully this is due to the bug you reported in bug #9542. Does this patch fix both issues? Unfortunately not. I just updated bugzilla. Best regards, Krzysztof Olędzki
Re: [PATCH 5/6] e1000: Secondary unicast address support
On Tue, 13 Nov 2007, Auke Kok wrote: From: Patrick McHardy [EMAIL PROTECTED] Add support for configuring secondary unicast addresses. Unicast addresses take precendece over multicast addresses when filling the exact address filters to avoid going to promiscous mode. When more unicast addresses are present than filter slots, unicast filtering is disabled and all slots can be used for multicast addresses. Is there any easy way to use it for VRRP? It would be really great to have two IP addresses on the same interface, each with a different hw address. Best regards, Krzysztof Olędzki
Re: [PATCH 5/6] e1000: Secondary unicast address support
On Tue, 13 Nov 2007, Ben Greear wrote: Krzysztof Oledzki wrote: On Tue, 13 Nov 2007, Auke Kok wrote: From: Patrick McHardy [EMAIL PROTECTED] Add support for configuring secondary unicast addresses. Unicast addresses take precendece over multicast addresses when filling the exact address filters to avoid going to promiscous mode. When more unicast addresses are present than filter slots, unicast filtering is disabled and all slots can be used for multicast addresses. Is there any easy way to use it for VRRP? It would be really great to have two IP addresses on the same interface, each with a different hw address. mac-vlans should do this for you..with our without the driver patch. I'm afraid mac-vlans is not a solution here. Having 2x more interfaces (ex. 2000 instead of 1000) makes everything (especially routing, firewalling and QoS) much more complicated. It would be nice to have something like ip addr add a.b.c.d/24 dev vlan32 hwaddress aa:bb:cc:dd:ee:ff. BTW: is it possible to stack mac-vlans ontop of .1Q vlans? Best regards, Krzysztof Olędzki
Re: [PATCH 5/6] e1000: Secondary unicast address support
On Tue, 13 Nov 2007, Ben Greear wrote: Krzysztof Oledzki wrote: I'm afraid mac-vlans is not a solution here. Having 2x more interfaces (ex. 2000 instead of 1000) makes everything (especially routing, firewalling and QoS) much more complicated. It would be nice to have something like ip addr add a.b.c.d/24 dev vlan32 hwaddress aa:bb:cc:dd:ee:ff. I'll take your word for it, though I have had good luck using mac-vlans in my own app. They are nice because the are full-fledged interfaces, so you can treat them basically as .1q vlans or ethernet devices, including all the routing and firewalling tricks. OK. But in my situation it is going to be: vlan1 (.1q) - real MAC vlan1a (mac-vlan) - VRRP MAC (...) vlan999 (.1q) - real MAC vlan999 (mac-vlan) - VRRP MAC ... with packets for the same destination coming in and out over both interfaces depending on a src ip address. BTW: is it possible to stack mac-vlans ontop of .1Q vlans? I believe it will work fine. You could probably also stack .1q VLANs on top of mac-vlans so long as you use the same MAC for the VLANs as for the mac-vlan dev. So, this is something exactly I don't want to do as I need two different MAC addresses. ;) Best regards, Krzysztof Olędzki
ISNs and 2.6.22, Was: Re: haproxy linux firewall (netfilter)
On Sat, 20 Oct 2007, Willy Tarreau wrote: CUT What is very strange is that linux uses random increments, so your ISNs should not wrap in a matter of a few seconds. Good point. I need to investigate this. netcat is very convenient for such tests. It's easy to bind it to a source port for consecutive tests while you run tcpdump in the background : $ echo bla | nc -p 1234 192.168.1.2 80 $ echo bla | nc -p 1234 192.168.1.2 80 Also, please try this with tcp_timestamps enabled and disabled to see if it changes anything. Interesting... :| 2.6.20: 18:52:33.558379 IP 192.168.0.33. 212.77.100.101.80: S 3708509816:3708509816(0) win 5840 mss 1460,sackOK,timestamp 1884090256 0,nop,wscale 1 18:52:33.882129 IP 192.168.0.33. 212.77.100.101.80: S 3708833567:3708833567(0) win 5840 mss 1460,sackOK,timestamp 1884090580 0,nop,wscale 1 18:52:34.084000 IP 192.168.0.33. 212.77.100.101.80: S 3709035437:3709035437(0) win 5840 mss 1460,sackOK,timestamp 1884090782 0,nop,wscale 1 2.6.21: 18:58:36.074969 IP 192.168.0.66. 212.77.100.101.80: S 110585153:110585153(0) win 5840 mss 1460,sackOK,timestamp 112007046 0,nop,wscale 5 18:58:36.440084 IP 192.168.0.66. 212.77.100.101.80: S 110950271:110950271(0) win 5840 mss 1460,sackOK,timestamp 112007412 0,nop,wscale 5 18:58:36.830141 IP 192.168.0.66. 212.77.100.101.80: S 111340328:111340328(0) win 5840 mss 1460,sackOK,timestamp 112007802 0,nop,wscale 5 2.6.22: 18:59:34.525097 IP 192.168.0.7. 212.77.100.101.80: S 3303295586:3303295586(0) win 5840 mss 1460,sackOK,timestamp 842 0,nop,wscale 6 18:59:34.942104 IP 192.168.0.7. 212.77.100.101.80: S 3720303240:3720303240(0) win 5840 mss 1460,sackOK,timestamp 1112259 0,nop,wscale 6 18:59:35.412229 IP 192.168.0.7. 212.77.100.101.80: S 4190427367:4190427367(0) win 5840 mss 1460,sackOK,timestamp 1112729 0,nop,wscale 6 2.6.22+tcp_timestamps=0: 19:00:38.285554 IP 192.168.0.7. 212.77.100.101.80: S 2639244549:2639244549(0) win 5840 mss 1460,nop,nop,sackOK,nop,wscale 6 19:00:39.448675 IP 192.168.0.7. 212.77.100.101.80: S 3802363348:3802363348(0) win 5840 mss 1460,nop,nop,sackOK,nop,wscale 6 19:00:43.003850 IP 192.168.0.7. 212.77.100.101.80: S 3062574559:3062574559(0) win 5840 mss 1460,nop,nop,sackOK,nop,wscale 6 19:00:45.950863 IP 192.168.0.7. 212.77.100.101.80: S 1714619373:1714619373(0) win 5840 mss 1460,nop,nop,sackOK,nop,wscale 6 So it seems that ISNs are not randomly incremented but rather randomly generated. Adding netdev@vger.kernel.org to the CC list. Best regards, Krzysztof Olędzki
Re: ISNs and 2.6.22, Was: Re: haproxy linux firewall (netfilter)
On Sat, 20 Oct 2007, Krzysztof Oledzki wrote: On Sat, 20 Oct 2007, Willy Tarreau wrote: CUT What is very strange is that linux uses random increments, so your ISNs should not wrap in a matter of a few seconds. Good point. I need to investigate this. netcat is very convenient for such tests. It's easy to bind it to a source port for consecutive tests while you run tcpdump in the background : $ echo bla | nc -p 1234 192.168.1.2 80 $ echo bla | nc -p 1234 192.168.1.2 80 Also, please try this with tcp_timestamps enabled and disabled to see if it changes anything. Interesting... :| 2.6.20: 18:52:33.558379 IP 192.168.0.33. 212.77.100.101.80: S 3708509816:3708509816(0) win 5840 mss 1460,sackOK,timestamp 1884090256 0,nop,wscale 1 18:52:33.882129 IP 192.168.0.33. 212.77.100.101.80: S 3708833567:3708833567(0) win 5840 mss 1460,sackOK,timestamp 1884090580 0,nop,wscale 1 18:52:34.084000 IP 192.168.0.33. 212.77.100.101.80: S 3709035437:3709035437(0) win 5840 mss 1460,sackOK,timestamp 1884090782 0,nop,wscale 1 2.6.21: 18:58:36.074969 IP 192.168.0.66. 212.77.100.101.80: S 110585153:110585153(0) win 5840 mss 1460,sackOK,timestamp 112007046 0,nop,wscale 5 18:58:36.440084 IP 192.168.0.66. 212.77.100.101.80: S 110950271:110950271(0) win 5840 mss 1460,sackOK,timestamp 112007412 0,nop,wscale 5 18:58:36.830141 IP 192.168.0.66. 212.77.100.101.80: S 111340328:111340328(0) win 5840 mss 1460,sackOK,timestamp 112007802 0,nop,wscale 5 2.6.22: 18:59:34.525097 IP 192.168.0.7. 212.77.100.101.80: S 3303295586:3303295586(0) win 5840 mss 1460,sackOK,timestamp 842 0,nop,wscale 6 18:59:34.942104 IP 192.168.0.7. 212.77.100.101.80: S 3720303240:3720303240(0) win 5840 mss 1460,sackOK,timestamp 1112259 0,nop,wscale 6 18:59:35.412229 IP 192.168.0.7. 212.77.100.101.80: S 4190427367:4190427367(0) win 5840 mss 1460,sackOK,timestamp 1112729 0,nop,wscale 6 2.6.22+tcp_timestamps=0: 19:00:38.285554 IP 192.168.0.7. 212.77.100.101.80: S 2639244549:2639244549(0) win 5840 mss 1460,nop,nop,sackOK,nop,wscale 6 19:00:39.448675 IP 192.168.0.7. 212.77.100.101.80: S 3802363348:3802363348(0) win 5840 mss 1460,nop,nop,sackOK,nop,wscale 6 19:00:43.003850 IP 192.168.0.7. 212.77.100.101.80: S 3062574559:3062574559(0) win 5840 mss 1460,nop,nop,sackOK,nop,wscale 6 19:00:45.950863 IP 192.168.0.7. 212.77.100.101.80: S 1714619373:1714619373(0) win 5840 mss 1460,nop,nop,sackOK,nop,wscale 6 So it seems that ISNs are not randomly incremented but rather randomly generated. Adding netdev@vger.kernel.org to the CC list. Eh, I was little to hurry this time. There were not randomly generated but incremented with to big value. This patch fixes my problem: http://git.kernel.org/?p=linux/kernel/git/stable/stable-queue.git;a=blob;f=queue-2.6.22/fix-tcp-initial-sequence-number-selection.patch;h=05b9167d68ecde1e6088f58c55e2906b768420ed;hb=HEAD Looking forward for a next -stable release. ;) Best regards, Krzysztof Olędzki
Re: TCP port randomization
On Wed, 17 Oct 2007, Stephen Hemminger wrote: On Thu, 18 Oct 2007 00:31:13 +0200 (CEST) Krzysztof Oledzki [EMAIL PROTECTED] wrote: On Wed, 17 Oct 2007, Stephen Hemminger wrote: On Wed, 17 Oct 2007 23:15:48 +0200 (CEST) Krzysztof Oledzki [EMAIL PROTECTED] wrote: Hello, Is it normal that TCP port randomization (tested with 2.6.22) works only when explicitly binding to a IP address: --- cut here --- [EMAIL PROTECTED]:~# nc 192.168.129.28 11 (UNKNOWN) [192.168.129.28] 11 (systat) : Connection refused [EMAIL PROTECTED]:~# nc 192.168.129.28 11 (UNKNOWN) [192.168.129.28] 11 (systat) : Connection refused [EMAIL PROTECTED]:~# nc 192.168.129.28 11 (UNKNOWN) [192.168.129.28] 11 (systat) : Connection refused 23:11:11.896126 IP 192.168.129.2.37839 192.168.129.28.11: S 23:11:12.146573 IP 192.168.129.2.37840 192.168.129.28.11: S 23:11:12.396488 IP 192.168.129.2.37841 192.168.129.28.11: S --- cut here --- --- cut here --- [EMAIL PROTECTED]:~# nc -s 192.168.129.2 192.168.129.28 11 (UNKNOWN) [192.168.129.28] 11 (systat) : Connection refused [EMAIL PROTECTED]:~# nc -s 192.168.129.2 192.168.129.28 11 (UNKNOWN) [192.168.129.28] 11 (systat) : Connection refused [EMAIL PROTECTED]:~# nc -s 192.168.129.2 192.168.129.28 11 (UNKNOWN) [192.168.129.28] 11 (systat) : Connection refused 23:11:31.704391 IP 192.168.129.2.57204 192.168.129.28.11: S 23:11:34.400048 IP 192.168.129.2.14512 192.168.129.28.11: S 23:11:34.606707 IP 192.168.129.2.20117 192.168.129.28.11: S --- cut here --- Best regards, Krzysztof Olędzki It is a expected side effect. So it is not possible to use randomization without binding to a specific srcip? The starting point for the search is based on hash(srcaddr, dstaddr, dstport, secret). You are using same source, dest and port so yes it will stay the same until rekeying occurs. The secret only changes every 5min same as TCP initial sequence number. If I get it right, even with explicitly selected constant srcaddr port numbers should simply increase? This is not what I observed. When you set srcaddr, it calls bind, and bind does randomization always independent of address. This existing behavior may seem odd, but it shouldn't present a security problem. Right. Thank you very much for the explanation. Best regards, Krzysztof Olędzki
Re: BUG: unable to handle kernel NULL pointer dereference at virtual address 000000b0
On Wed, 17 Oct 2007, Eric Dumazet wrote: Krzysztof Oledzki a écrit : On Wed, 17 Oct 2007, Eric Dumazet wrote: Krzysztof Oledzki a écrit : On Wed, 17 Oct 2007, Eric Dumazet wrote: Krzysztof Oledzki a écrit : Hello, Today I found in my logs: BUG: unable to handle kernel NULL pointer dereference at virtual address 00b0 printing eip: 78395f65 *pde = Oops: [#1] PREEMPT SMP CPU:0 EIP:0060:[78395f65]Not tainted VLI EFLAGS: 00210286 (2.6.22.9 #1) EIP is at __ip_route_output_key+0x412/0x722 eax: 8000 ebx: ecx: 5dd2b1c3 edx: esi: edi: d44c7e30 ebp: ec8c4980 esp: d44c7ddc ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0068 Process smtpd (pid: 12479, ti=d44c6000 task=9e759510 task.ti=d44c6000) Stack: d44c7e7c d44c7e7c d44c7eb8 d44c7e7c 0005 5dd2b1c3 0003 d44c7e7c Call Trace: [78396280] ip_route_output_flow+0xb/0x3e [783b2b29] ip4_datagram_connect+0x1c9/0x308 [783ba70a] inet_dgram_connect+0x45/0x4e [7837135e] sys_connect+0x72/0x9c [78371607] sock_map_fd+0x41/0x4a [7840d1b1] _spin_lock+0x33/0x3e [7840d623] _spin_unlock+0x25/0x3b [78371607] sock_map_fd+0x41/0x4a [78372792] sys_socketcall+0x8f/0x242 [7813e99c] trace_hardirqs_on+0x122/0x14c [78103dc6] sysenter_past_esp+0x8f/0x99 [78103d96] sysenter_past_esp+0x5f/0x99 === Code: fa e0 00 00 00 75 07 c6 44 24 56 05 eb 14 81 fa f0 00 00 00 0f 84 e1 02 00 00 84 c0 0f 84 d9 02 00 00 8b 44 24 0c 0d 00 00 00 80 f6 86 b0 00 00 00 08 0f 44 44 24 0c 89 44 24 0c b8 01 00 00 00 EIP: [78395f65] __ip_route_output_key+0x412/0x722 SS:ESP 0068:d44c7ddc Shortly before it there was: Oct 17 07:17:55 cougar postfix/master[3400]: warning: process /usr/lib/postfix/smtpd pid 12479 killed by signal 11 Best regards, Krzysztof Olędzki Hello Krzysztof Could you give us some details about this ? kernel version at least. Yes, I was little to hurry sending this bug report. Anyway, it is 2.6.22.9 like mentioned in the oops: EFLAGS: 00210286 (2.6.22.9 #1) (you could for example take a look at REPORTING-BUGS, or run scripts/ver_linux) Linux cougar 2.6.22.9 #1 SMP PREEMPT Wed Oct 3 10:24:19 CEST 2007 i686 Intel(R) Pentium(R) D CPU 3.20GHz GenuineIntel GNU/Linux Gnu C 4.1.2 Gnu make 3.81 binutils 2.17 util-linux 2.12r mount 2.12r module-init-tools 3.2.2 e2fsprogs 1.40.2 Linux C Library libc.2.5 Dynamic linker (ldd) 2.5 Procps 3.2.7 Net-tools 1.60 Kbd1.12 Sh-utils 6.9 Yes indeed, version was on your initial report. It seems this kernel is unusual (VMSPLIT_2G_OPT instead of stdandard VMSPLIT_3G), any chance you provide full .config ? Attached, both .config and dmesg. Hum, you are using IPT_TPROXY thing, which is not in linux-2.6.22.9 It is only compiled in, not used at the moment. I have no idea how this can taint the kernel, since you provide no information. Try to reproduce the problem with a genuine kernel. OK. Thank you. Best regards, Krzysztof Olędzki
Re: BUG: unable to handle kernel NULL pointer dereference at virtual address 000000b0
On Thu, 18 Oct 2007, Patrick McHardy wrote: Krzysztof Oledzki wrote: Hum, you are using IPT_TPROXY thing, which is not in linux-2.6.22.9 It is only compiled in, not used at the moment. But at least the previous version (before those patches posted a week ago) touches the routing code in exactly that function. Right. Thank you. Best regards, Krzysztof Olędzki
BUG: unable to handle kernel NULL pointer dereference at virtual address 000000b0
Hello, Today I found in my logs: BUG: unable to handle kernel NULL pointer dereference at virtual address 00b0 printing eip: 78395f65 *pde = Oops: [#1] PREEMPT SMP CPU:0 EIP:0060:[78395f65]Not tainted VLI EFLAGS: 00210286 (2.6.22.9 #1) EIP is at __ip_route_output_key+0x412/0x722 eax: 8000 ebx: ecx: 5dd2b1c3 edx: esi: edi: d44c7e30 ebp: ec8c4980 esp: d44c7ddc ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0068 Process smtpd (pid: 12479, ti=d44c6000 task=9e759510 task.ti=d44c6000) Stack: d44c7e7c d44c7e7c d44c7eb8 d44c7e7c 0005 5dd2b1c3 0003 d44c7e7c Call Trace: [78396280] ip_route_output_flow+0xb/0x3e [783b2b29] ip4_datagram_connect+0x1c9/0x308 [783ba70a] inet_dgram_connect+0x45/0x4e [7837135e] sys_connect+0x72/0x9c [78371607] sock_map_fd+0x41/0x4a [7840d1b1] _spin_lock+0x33/0x3e [7840d623] _spin_unlock+0x25/0x3b [78371607] sock_map_fd+0x41/0x4a [78372792] sys_socketcall+0x8f/0x242 [7813e99c] trace_hardirqs_on+0x122/0x14c [78103dc6] sysenter_past_esp+0x8f/0x99 [78103d96] sysenter_past_esp+0x5f/0x99 === Code: fa e0 00 00 00 75 07 c6 44 24 56 05 eb 14 81 fa f0 00 00 00 0f 84 e1 02 00 00 84 c0 0f 84 d9 02 00 00 8b 44 24 0c 0d 00 00 00 80 f6 86 b0 00 00 00 08 0f 44 44 24 0c 89 44 24 0c b8 01 00 00 00 EIP: [78395f65] __ip_route_output_key+0x412/0x722 SS:ESP 0068:d44c7ddc Shortly before it there was: Oct 17 07:17:55 cougar postfix/master[3400]: warning: process /usr/lib/postfix/smtpd pid 12479 killed by signal 11 Best regards, Krzysztof Olędzki
Re: BUG: unable to handle kernel NULL pointer dereference at virtual address 000000b0
On Wed, 17 Oct 2007, Eric Dumazet wrote: Krzysztof Oledzki a écrit : Hello, Today I found in my logs: BUG: unable to handle kernel NULL pointer dereference at virtual address 00b0 printing eip: 78395f65 *pde = Oops: [#1] PREEMPT SMP CPU:0 EIP:0060:[78395f65]Not tainted VLI EFLAGS: 00210286 (2.6.22.9 #1) EIP is at __ip_route_output_key+0x412/0x722 eax: 8000 ebx: ecx: 5dd2b1c3 edx: esi: edi: d44c7e30 ebp: ec8c4980 esp: d44c7ddc ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0068 Process smtpd (pid: 12479, ti=d44c6000 task=9e759510 task.ti=d44c6000) Stack: d44c7e7c d44c7e7c d44c7eb8 d44c7e7c 0005 5dd2b1c3 0003 d44c7e7c Call Trace: [78396280] ip_route_output_flow+0xb/0x3e [783b2b29] ip4_datagram_connect+0x1c9/0x308 [783ba70a] inet_dgram_connect+0x45/0x4e [7837135e] sys_connect+0x72/0x9c [78371607] sock_map_fd+0x41/0x4a [7840d1b1] _spin_lock+0x33/0x3e [7840d623] _spin_unlock+0x25/0x3b [78371607] sock_map_fd+0x41/0x4a [78372792] sys_socketcall+0x8f/0x242 [7813e99c] trace_hardirqs_on+0x122/0x14c [78103dc6] sysenter_past_esp+0x8f/0x99 [78103d96] sysenter_past_esp+0x5f/0x99 === Code: fa e0 00 00 00 75 07 c6 44 24 56 05 eb 14 81 fa f0 00 00 00 0f 84 e1 02 00 00 84 c0 0f 84 d9 02 00 00 8b 44 24 0c 0d 00 00 00 80 f6 86 b0 00 00 00 08 0f 44 44 24 0c 89 44 24 0c b8 01 00 00 00 EIP: [78395f65] __ip_route_output_key+0x412/0x722 SS:ESP 0068:d44c7ddc Shortly before it there was: Oct 17 07:17:55 cougar postfix/master[3400]: warning: process /usr/lib/postfix/smtpd pid 12479 killed by signal 11 Best regards, Krzysztof Olędzki Hello Krzysztof Could you give us some details about this ? kernel version at least. Yes, I was little to hurry sending this bug report. Anyway, it is 2.6.22.9 like mentioned in the oops: EFLAGS: 00210286 (2.6.22.9 #1) (you could for example take a look at REPORTING-BUGS, or run scripts/ver_linux) Linux cougar 2.6.22.9 #1 SMP PREEMPT Wed Oct 3 10:24:19 CEST 2007 i686 Intel(R) Pentium(R) D CPU 3.20GHz GenuineIntel GNU/Linux Gnu C 4.1.2 Gnu make 3.81 binutils 2.17 util-linux 2.12r mount 2.12r module-init-tools 3.2.2 e2fsprogs 1.40.2 Linux C Library libc.2.5 Dynamic linker (ldd) 2.5 Procps 3.2.7 Net-tools 1.60 Kbd1.12 Sh-utils 6.9 Best regards, Krzysztof OLędzki
TCP port randomization
Hello, Is it normal that TCP port randomization (tested with 2.6.22) works only when explicitly binding to a IP address: --- cut here --- [EMAIL PROTECTED]:~# nc 192.168.129.28 11 (UNKNOWN) [192.168.129.28] 11 (systat) : Connection refused [EMAIL PROTECTED]:~# nc 192.168.129.28 11 (UNKNOWN) [192.168.129.28] 11 (systat) : Connection refused [EMAIL PROTECTED]:~# nc 192.168.129.28 11 (UNKNOWN) [192.168.129.28] 11 (systat) : Connection refused 23:11:11.896126 IP 192.168.129.2.37839 192.168.129.28.11: S 23:11:12.146573 IP 192.168.129.2.37840 192.168.129.28.11: S 23:11:12.396488 IP 192.168.129.2.37841 192.168.129.28.11: S --- cut here --- --- cut here --- [EMAIL PROTECTED]:~# nc -s 192.168.129.2 192.168.129.28 11 (UNKNOWN) [192.168.129.28] 11 (systat) : Connection refused [EMAIL PROTECTED]:~# nc -s 192.168.129.2 192.168.129.28 11 (UNKNOWN) [192.168.129.28] 11 (systat) : Connection refused [EMAIL PROTECTED]:~# nc -s 192.168.129.2 192.168.129.28 11 (UNKNOWN) [192.168.129.28] 11 (systat) : Connection refused 23:11:31.704391 IP 192.168.129.2.57204 192.168.129.28.11: S 23:11:34.400048 IP 192.168.129.2.14512 192.168.129.28.11: S 23:11:34.606707 IP 192.168.129.2.20117 192.168.129.28.11: S --- cut here --- Best regards, Krzysztof Olędzki
Re: TCP port randomization
On Wed, 17 Oct 2007, Stephen Hemminger wrote: On Wed, 17 Oct 2007 23:15:48 +0200 (CEST) Krzysztof Oledzki [EMAIL PROTECTED] wrote: Hello, Is it normal that TCP port randomization (tested with 2.6.22) works only when explicitly binding to a IP address: --- cut here --- [EMAIL PROTECTED]:~# nc 192.168.129.28 11 (UNKNOWN) [192.168.129.28] 11 (systat) : Connection refused [EMAIL PROTECTED]:~# nc 192.168.129.28 11 (UNKNOWN) [192.168.129.28] 11 (systat) : Connection refused [EMAIL PROTECTED]:~# nc 192.168.129.28 11 (UNKNOWN) [192.168.129.28] 11 (systat) : Connection refused 23:11:11.896126 IP 192.168.129.2.37839 192.168.129.28.11: S 23:11:12.146573 IP 192.168.129.2.37840 192.168.129.28.11: S 23:11:12.396488 IP 192.168.129.2.37841 192.168.129.28.11: S --- cut here --- --- cut here --- [EMAIL PROTECTED]:~# nc -s 192.168.129.2 192.168.129.28 11 (UNKNOWN) [192.168.129.28] 11 (systat) : Connection refused [EMAIL PROTECTED]:~# nc -s 192.168.129.2 192.168.129.28 11 (UNKNOWN) [192.168.129.28] 11 (systat) : Connection refused [EMAIL PROTECTED]:~# nc -s 192.168.129.2 192.168.129.28 11 (UNKNOWN) [192.168.129.28] 11 (systat) : Connection refused 23:11:31.704391 IP 192.168.129.2.57204 192.168.129.28.11: S 23:11:34.400048 IP 192.168.129.2.14512 192.168.129.28.11: S 23:11:34.606707 IP 192.168.129.2.20117 192.168.129.28.11: S --- cut here --- Best regards, Krzysztof Olędzki It is a expected side effect. So it is not possible to use randomization without binding to a specific srcip? The starting point for the search is based on hash(srcaddr, dstaddr, dstport, secret). You are using same source, dest and port so yes it will stay the same until rekeying occurs. The secret only changes every 5min same as TCP initial sequence number. If I get it right, even with explicitly selected constant srcaddr port numbers should simply increase? This is not what I observed. Thanks. Best regards, Krzysztof Olędzki
Re: incorrect cksum with tcp/udp on lo with 2.6.20/2.6.21/2.6.22
On Tue, 2 Oct 2007, Herbert Xu wrote: On Mon, Sep 24, 2007 at 11:44:19AM +0200, Krzysztof Oledzki wrote: So, with DR mode, packet goes by the lo device (with bad checksum) and then get redirected outside. Unfortunately, when it leaves host it has bad checksum, too. :( Did you check this by taking a tcpdump on an external host? Yes. Doing a local tcpdump doesn't work as tcpdump won't show the correct checksum if checksum offload is enabled. Indeed, I'm aware about this. If it's really sending a bogus checksum then it's a bug in LVS. I'm not sure if we should call it a bug. LVS does not support such configuration by default - it requires kernel patching. However, it worked with older kernels so that's why I asked if it is possible to force full TCP/UDP checksum calculation? Thank you. Best regards, Krzysztof Olędzki
Upgradeing 2.6.21.7-2.6.22.9 kill my network (sky2): sky2 eth0: rx error, status 0x402300 length 60
Hello, After upgrading my kernel from 2.6.21.7 to 2.6.22.9 my 88E8053 no longer works: sky2 :02:00.0: v1.14 addr 0xcfffc000 irq 17 Yukon-EC (0xb6) rev 1 sky2 eth0: addr 00:11:d8:50:f6:28 sky2 eth0: enabling interface sky2 eth0: ram buffer 48K sky2 eth0: Link is up at 100 Mbps, full duplex, flow control both sky2 eth0: rx error, status 0x402300 length 60 sky2 eth0: rx error, status 0x402500 length 60 sky2 eth0: rx error, status 0x402300 length 60 sky2 eth0: rx error, status 0x402500 length 60 sky2 eth0: rx error, status 0x402300 length 60 sky2 eth0: rx error, status 0x402500 length 60 sky2 eth0: rx error, status 0x402300 length 60 sky2 eth0: rx error, status 0x402300 length 60 sky2 eth0: rx error, status 0x402500 length 60 sky2 eth0: rx error, status 0x402300 length 60 sky2 eth0: rx error, status 0x402500 length 60 sky2 eth0: rx error, status 0x402500 length 60 sky2 eth0: rx error, status 0x402500 length 60 sky2 eth0: rx error, status 0x402500 length 60 (...) I also compared lspci output from both 2.6.21/2.6.22 and it is the same: 02:00.0 Ethernet controller [0200]: Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit Ethernet Controller [11ab:4362] (rev 15) Subsystem: ASUSTeK Computer Inc. Marvell 88E8053 Gigabit Ethernet controller PCIe (Asus) [1043:8142] Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- Latency: 0, Cache Line Size: 16 bytes Interrupt: pin A routed to IRQ 221 Region 0: Memory at cfffc000 (64-bit, non-prefetchable) [size=16K] Region 2: I/O ports at d800 [size=256] Expansion ROM at cffc [disabled] [size=128K] Capabilities: [48] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=1 PME- Capabilities: [50] Vital Product Data Capabilities: [5c] Message Signalled Interrupts: Mask- 64bit+ Queue=0/1 Enable+ Address: fee0300c Data: 41c9 Capabilities: [e0] Express Legacy Endpoint IRQ 0 Device: Supported: MaxPayload 128 bytes, PhantFunc 0, ExtTag- Device: Latency L0s unlimited, L1 unlimited Device: AtnBtn- AtnInd- PwrInd- Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported- Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr+ NoSnoop- Device: MaxPayload 128 bytes, MaxReadReq 512 bytes Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s, Port 0 Link: Latency L0s 256ns, L1 unlimited Link: ASPM Disabled RCB 128 bytes CommClk- ExtSynch- Link: Speed 2.5Gb/s, Width x1 00: ab 11 62 43 07 04 10 00 15 00 00 02 04 00 00 00 10: 04 c0 ff cf 00 00 00 00 01 d8 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 43 10 42 81 30: 00 00 fc cf 48 00 00 00 00 00 00 00 0a 01 00 00 40: 00 00 f0 01 00 80 a0 01 01 50 02 fe 00 20 00 14 50: 03 5c 00 80 00 00 00 01 00 00 00 01 05 e0 83 00 60: 0c 30 e0 fe 00 00 00 00 c9 41 00 00 00 00 00 00 70: 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e0: 10 00 11 00 c0 0f 00 00 00 24 1b 00 11 a4 03 00 f0: 08 00 11 10 00 00 00 00 00 00 00 00 00 00 00 00 It is quite strange as on the other similar system (only rev 1/2 difference), sky2 driver from this 2.6.22 kernel solved my problem (network hangs): sky2 :03:00.0: v1.14 addr 0xf100 irq 16 Yukon-EC (0xb6) rev 2 sky2 eth0: addr 00:16:e6:5f:64:24 sky2 eth0: enabling interface sky2 eth0: ram buffer 48K sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control both Best regards, Krzysztof Olędzki
Re: Upgrading 2.6.21.7-2.6.22.9 kills my network (sky2): sky2 eth0: rx error, status 0x402300 length 60
On Fri, 28 Sep 2007, Krzysztof Oledzki wrote: On Fri, 28 Sep 2007, Krzysztof Oledzki wrote: Hello, After upgrading my kernel from 2.6.21.7 to 2.6.22.9 my 88E8053 no longer works: Small update: 2.6.22.9 with sky2.c/sky2.h from 2.4.22.4 works without any problems. Final update. Reverting this patch: http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.22.y.git;a=commitdiff_plain;h=8c07a8e30ba8a2e0831da4b134202598435f8358 solved my problem. I also found this one: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff_plain;h=d6532232cd3de79c852685823a9c52f723816d0a Could it go to a next -stable ASAP, please? It seems that 2.6.22.5-2.6.22.9 kernels have broken sky2 if used with vlans. :( Such regression in a -stable kernel isn't nice. :( Best regards, Krzysztof Olędzki
Re: Upgradeing 2.6.21.7-2.6.22.9 kill my network (sky2): sky2 eth0: rx error, status 0x402300 length 60
On Fri, 28 Sep 2007, Krzysztof Oledzki wrote: Hello, After upgrading my kernel from 2.6.21.7 to 2.6.22.9 my 88E8053 no longer works: Small update: 2.6.22.9 with sky2.c/sky2.h from 2.4.22.4 works without any problems. Best regards, Krzysztof Olędzki
Re: [stable] Upgrading 2.6.21.7-2.6.22.9 kills my network (sky2): sky2 eth0: rx error, status 0x402300 length 60
On Fri, 28 Sep 2007, Greg KH wrote: On Fri, Sep 28, 2007 at 01:11:27PM +0200, Krzysztof Oledzki wrote: On Fri, 28 Sep 2007, Krzysztof Oledzki wrote: On Fri, 28 Sep 2007, Krzysztof Oledzki wrote: Hello, After upgrading my kernel from 2.6.21.7 to 2.6.22.9 my 88E8053 no longer works: Small update: 2.6.22.9 with sky2.c/sky2.h from 2.4.22.4 works without any problems. Final update. Reverting this patch: http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.22.y.git;a=commitdiff_plain;h=8c07a8e30ba8a2e0831da4b134202598435f8358 solved my problem. I also found this one: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff_plain;h=d6532232cd3de79c852685823a9c52f723816d0a Could it go to a next -stable ASAP, please? It seems that 2.6.22.5-2.6.22.9 kernels have broken sky2 if used with vlans. :( Such regression in a -stable kernel isn't nice. :( So should we just apply the second patch? I'll let Stephen tell us what we should do :) Second patch works for me, so IMHO yes. Forget to mention that earlier, sorry. Ofcourse this should be the maintainer decision, this is only my vote. :) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] sky2: fix VLAN receive processing
On Fri, 28 Sep 2007, Stephen Hemminger wrote: The length check for truncated frames was not correctly handling the case where VLAN acceleration had already read the tag. Also, the Yukon EX has some features that use high bit of status as security tag. Thank you. Best regards Krzysztof Oledzki - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: incorrect cksum with tcp/udp on lo with 2.6.20/2.6.21/2.6.22
On Mon, 24 Sep 2007, Herbert Xu wrote: On Sun, Sep 23, 2007 at 11:18:58PM +0200, Krzysztof Oledzki wrote: Thank you for the information. Is there any easy way to turn them on? I need it for LVS. Do you really need it? Yes. I would like to use a LVS redirector as both a client and a director: http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.LVS-DR.html#director_as_client_in_LVS-DR The packets should be checksummed at the point where they physically leave the host. So, with DR mode, packet goes by the lo device (with bad checksum) and then get redirected outside. Unfortunately, when it leaves host it has bad checksum, too. :( Best regards, Krzysztof Olędzki
Re: incorrect cksum with tcp/udp on lo with 2.6.20/2.6.21/2.6.22
On Sun, 23 Sep 2007, Herbert Xu wrote: Krzysztof Oledzki [EMAIL PROTECTED] wrote: It seems that after some not very recent changes udp and tcp packes carring data send by a loopback have incorrect cksum: This correct. The loopback interfaces has the no checksum flag set so we only provide a partial checksum on output (i.e., the pseudoheader without the payload). We even export this to user-space via a flag. So you should fix tcpdump to read this flag and ignore the checksum. Thank you for the information. Is there any easy way to turn them on? I need it for LVS. Best regards, Krzysztof Olędzki
Re: [PATCH 1/2] bnx2: factor out gzip unpacker
On Fri, 21 Sep 2007, Denys Vlasenko wrote: On Friday 21 September 2007 19:36, [EMAIL PROTECTED] wrote: On Fri, 21 Sep 2007 19:05:23 BST, Denys Vlasenko said: I plan to use gzip compression on following drivers' firmware, if patches will be accepted: textdata bss dec hex filename 17653 109968 240 127861 1f375 drivers/net/acenic.o 6628 120448 4 127080 1f068 drivers/net/dgrs.o ^^ Should this be redone to use the existing firmware loading framework to load the firmware instead? Not in every case. For example, bnx2 maintainer says that driver and firmware are closely tied for his driver. IOW: you upgrade kernel and your NIC is not working anymore. Firmware may come with a kernel. We have a install modules, we can also add install firmware. Another argument is to make kernel be able to bring up NICs without needing firmware images in initramfs/initrd/hard drive. It is not possible to bring up things like FC or WiFi without firmware, what special is in classic NICs? Best regards, Krzysztof Olędzki
incorrect cksum with tcp/udp on lo with 2.6.20/2.6.21/2.6.22
Hello, It seems that after some not very recent changes udp and tcp packes carring data send by a loopback have incorrect cksum: UDP: # echo test|nc -u 127.0.0.1 # tcpdump -i lo -n -v -v port tcpdump: listening on lo, link-type EN10MB (Ethernet), capture size 96 bytes 19:43:39.340576 IP (tos 0x0, ttl 64, id 15179, offset 0, flags [DF], proto: UDP (17), length: 33) 127.0.0.1.49512 127.0.0.1.: [bad udp cksum 174c!] UDP, length 5 TCP: # echo test|nc -u 127.0.0.1 tcpdump: listening on lo, link-type EN10MB (Ethernet), capture size 96 bytes *Correct: 19:44:27.692614 IP (tos 0x0, ttl 64, id 32100, offset 0, flags [DF], proto: TCP (6), length: 60) 127.0.0.1.53804 127.0.0.1.: S, cksum 0xfd54 (correct), 3426125135:3426125135(0) win 32792 mss 16396,sackOK,timestamp 1912797227 0,nop,wscale 7 19:44:27.692674 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto: TCP (6), length: 60) 127.0.0.1. 127.0.0.1.53804: S, cksum 0xea3f (correct), 3427916955:3427916955(0) ack 3426125136 win 32768 mss 16396,sackOK,timestamp 1912797227 1912797227,nop,wscale 7 19:44:27.692711 IP (tos 0x0, ttl 64, id 32101, offset 0, flags [DF], proto: TCP (6), length: 52) 127.0.0.1.53804 127.0.0.1.: ., cksum 0xd263 (correct), 1:1(0) ack 1 win 257 nop,nop,timestamp 1912797227 1912797227 *Incorrect: 19:44:27.692831 IP (tos 0x0, ttl 64, id 32102, offset 0, flags [DF], proto: TCP (6), length: 57) 127.0.0.1.53804 127.0.0.1.: P, cksum 0xfe2d (incorrect (- 0xe07c), 1:6(5) ack 1 win 257 nop,nop,timestamp 1912797227 1912797227 *Correct: 19:44:27.692859 IP (tos 0x0, ttl 64, id 9399, offset 0, flags [DF], proto: TCP (6), length: 52) 127.0.0.1. 127.0.0.1.53804: ., cksum 0xd25f (correct), 1:1(0) ack 6 win 256 nop,nop,timestamp 1912797227 1912797227 Tested on: - 2.6.22.6 - 2.6.21.7 - 2.6.20.11 Best regards, Krzysztof Olędzki
sky2: workaround for lost IRQ and 2.6.22-stable
Hello, http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.21.y.git;a=commitdiff;h=fe1fe7c982f86624c692644e8ed05e132f4753cc Is this fix going to be included in the next 2.6.22-stable release or is it not needed any more? Best regards, Krzysztof Olędzki
Re: Network card IRQ balancing with Intel 5000 series chipsets
On Wed, 27 Dec 2006, jamal wrote: On Wed, 2006-27-12 at 09:09 +0200, Robert Iakobashvili wrote: My scenario is treatment of RTP packets in kernel space with a single network card (both Rx and Tx). The default of the Intel 5000 series chipset is affinity of each network card to a certain CPU. Currently, neither with irqbalance nor with kernel irq-balancing (MSI and io-apic attempted) I do not find a way to balance that irq. In the near future, when the NIC vendors wake up[1] because CPU vendors - including big bad Intel - are going to be putting out a large number of hardware threads, you should be able to do more clever things with such a setup. At the moment, just tie it to a single CPU and have your other processes that are related running/bound on the other cores so you can utilize them. OTOH, you say you are only using 30% of the one CPU, so it may not be a big deal to tie your single nic to on cpu. Anyway, it seems that with more advanced firewalls/routers kernel spends most of a time in IPSec/crypto code, netfilter conntrack and iptables rules/extensions, routing lookups, etc and not in hardware IRQ handler. So, it would be nice if this part coulde done by all CPUs. Best regards, Krzysztof Olędzki
Re: gratuitous arp
On Sun, 26 Nov 2006, James Courtier-Dutton wrote: dean gaudet wrote: On Sun, 26 Nov 2006, James Courtier-Dutton wrote: dean gaudet wrote: hi... i ran into some problems recently which would have been avoided if my box did a gratuitous arp as it brought up all interfaces (the router took forever to timeout the ARP entries for interface aliases). so i set about looking to see why that wasn't happening. ... Are you 100% sure about this? Have you done a packet sniff on the network? A lot of routers ignore gratuitous arp for security reasons. yeah i've done some packet sniffing to verify this. here's what happened (twice now): i upgraded a (normally busy) box, so the MAC address changed. the router is a cisco (not managed by me). debian reboot sequence at some point brings up the primary eth0 address and very soon thereafter there will be an arp who-has $default_gw tell $primary_addr. that's sufficient to get the cisco to update its ARP cache for $primary_addr. this isn't gratuitous arp, but does the trick for the $primary_addr. but there's no gratuitous arp for any eth0:N aliased interfaces... and the cisco ARP cache on this ISP router seems to be set to a long timeout. i could reach eth0:N from local net, but couldn't get outside local net from eth0:N. issuing arping -I eth0 -s $secondary_addr $default_gw for each secondary address updated the cisco ARP cache and i could then reach eth0:N remotely. so... that may not be exactly gratuitous arp, but basically i was stuck until i forced the cisco to update its ARP cache for each of the secondary addrs... it seems to me it'd be nice for the init sequence to take care of this, so that other folks don't have to spend time debugging similar problems. i just wanted to ask if i'm missing something obvious before i go open a debian bug. (i'm tempted to see if fedora does anything differently.) thanks -dean Ok, I think it is better to just do gratuitous arp on the primary interface. If one starts doing it on secondary interfaces, one would then have to also do it for all proxy-arp addresses(if used), and thinks could start getting rather messy. BTW: There is no such thing like secondary interfaces. What you use (ethX:X) is emulation of interface aliases that was necessary for linux 2.2.x, more than 5 yers ago. Currently (2.4/2.6) it is possible to add many addressess to one interface - all you need is the iproute2 package and utility called ip. Best regards, Krzysztof Olędzki
Re: Zero checksum in netconsole/netdump packets
On Tue, 7 Nov 2006, Gerrit Renker wrote: Quoting Chris Lalancette: | Hello, | I realized that all of the packets that go from the crashing machine to the netdump server have a zero checksum. snip | Assuming that this is just an oversight, attached is a simple patch to compute the UDP checksum in netpoll_send_udp. | | Signed-off-by: Chris Lalancette [EMAIL PROTECTED] | RFC 768 allows to not compute the checksum by leaving uh-check at 0 - hence it is not illegal. BTW: leaving UDP checksum at 0 is only valid for IPv4, with IPv6 we _have to_ compute a checksum. Best regards, Krzysztof Olędzki
Re: [Bugme-new] [Bug 7421] New: Oops, EIP is at atalk_sendmsg
On Thu, 26 Oct 2006, Andrew Morton wrote: On Thu, 26 Oct 2006 04:08:36 -0700 [EMAIL PROTECTED] wrote: http://bugzilla.kernel.org/show_bug.cgi?id=7421 Summary: Oops, EIP is at atalk_sendmsg Kernel Version: 2.6.18.1 Status: NEW Severity: normal Owner: [EMAIL PROTECTED] Submitter: [EMAIL PROTECTED] Distribution: Debian sarge Hardware Environment: i386 Problem Description: ct 26 10:01:03 localhost papd[3120]: restart (2.0.3) Oct 26 10:01:07 localhost kernel: BUG: unable to handle kernel NULL pointer \ dereference at virtual address Oct 26 10:01:07 localhost kernel: printing eip: Oct 26 10:01:07 localhost kernel: d0c16a8a Oct 26 10:01:07 localhost kernel: *pde = Oct 26 10:01:07 localhost kernel: Oops: [#1] Oct 26 10:01:07 localhost kernel: Modules linked in: appletalk psnap llc ipv6 \ pcmcia_core af_packet parport_pc parport floppy pcspkr sn d_maestro3 snd_ac97_codec \ snd_ac97_bus snd_pcm snd_timer snd_page_alloc snd soundcore intel_agp uhci_hcd \ usbcore 3c59x mii agpgart mous edev tsdev joydev psmouse ide_cd cdrom rtc reiserfs \ ext3 jbd ide_disk ide_generic siimage aec62xx trm290 alim15x3 hpt34x hpt366 cmd64x \ piix rz1000 slc90e66 generic cs5530 cs5520 sc1200 triflex atiixp pdc202xx_old \ pdc202xx_new opti621 ns87415 cy82c693 amd74xx sis5513 via 82cxxx serverworks ide_core \ unix Oct 26 10:01:07 localhost kernel: CPU:0 Oct 26 10:01:07 localhost kernel: EIP:0060:[pg0+277633674/1070257152] Not \ tainted VLI Oct 26 10:01:07 localhost kernel: EFLAGS: 00010286 (2.6.17.14.2006-10-25 #1) Oct 26 10:01:07 localhost kernel: EIP is at atalk_sendmsg+0x15b/0x4e4 [appletalk] Oct 26 10:01:07 localhost kernel: eax: ebx: 002f ecx: \ edx: Oct 26 10:01:07 localhost kernel: esi: cadcb600 edi: ebp: cc9d7eec \ esp: cc9d7d6c Oct 26 10:01:07 localhost kernel: ds: 007b es: 007b ss: 0068 Oct 26 10:01:07 localhost kernel: Process afpd (pid: 3118, threadinfo=cc9d6000 \ task=cfe205d0) Oct 26 10:01:07 localhost kernel: Stack: c02b32c0 cc9d7ee8 cffbc500 \ d0c16f05 cffbc500 Oct 26 10:01:07 localhost kernel:cffbc500 cc9d7ec8 cadcb600 \ 0400 cc9d7f48 001b Oct 26 10:01:07 localhost kernel:cc9d7ec8 cc9d7e1c cc9d7ee8 c01fe97a cc9d7e1c \ ca252600 cc9d7ec8 001b Oct 26 10:01:07 localhost kernel: Call Trace: Oct 26 10:01:07 localhost kernel: d0c16f05 atalk_recvmsg+0xf2/0x105 [appletalk] \ c01fe97a sock_sendmsg+0xd0/0xeb Oct 26 10:01:07 localhost kernel: c0157bfd touch_atime+0xb4/0xbb c0198b22 \ copy_from_user+0x34/0x5a Oct 26 10:01:07 localhost kernel: c012383e autoremove_wake_function+0x0/0x3a \ c0198b22 copy_from_user+0x34/0x5a Oct 26 10:01:07 localhost kernel: c01fe490 move_addr_to_kernel+0x24/0x39 \ c01ffaaa sys_sendto+0xe9/0x10d Oct 26 10:01:07 localhost kernel: c01fe67e sock_attach_fd+0x72/0xd2 c0143d52 \ get_empty_filp+0x3b/0xe4 Oct 26 10:01:07 localhost kernel: c0143d7b get_empty_filp+0x64/0xe4 c0198ae4 \ copy_to_user+0x32/0x3c Oct 26 10:01:07 localhost kernel: c02001de sys_socketcall+0xf2/0x180 c0102a03 \ syscall_call+0x7/0xb Oct 26 10:01:07 localhost kernel: Code: 0c 83 c0 04 eb 15 c6 44 24 1a 00 0f b7 86 26 \ 01 00 00 66 89 44 24 18 8d 44 24 18 50 e8 e0 eb ff ff 89 44 24 04 85 f6 5d 8b 14 24 \ 8b 12 89 54 24 04 74 1b 8b 86 84 00 00 00 f6 c4 04 74 10 52 53 Oct 26 10:01:07 localhost kernel: EIP: [pg0+277633674/1070257152] \ atalk_sendmsg+0x15b/0x4e4 [appletalk] SS:ESP 0068:cc9d7d6c Oct 26 10:01:21 localhost atalkd[3106]: as_timer gateway 8000.100 down Steps to reproduce: restart the machine, start papd after network initializing has finished a second start of papd works fine appletalk is loades as module same behaviour with 2.6.17.14 Something like me too: Unable to handle kernel NULL pointer dereference at virtual address printing eip: c036b1ef *pde = Oops: [#1] PREEMPT Modules linked in: bonding CPU:0 EIP:0060:[c036b1ef]Not tainted VLI EFLAGS: 00010286 (2.6.15.1) EIP is at atalk_sendmsg+0x158/0x557 eax: d468fee4 ebx: 0017 ecx: d468fd20 edx: esi: edi: d7e88200 ebp: bfa7c480 esp: d468fd68 ds: 007b es: 007b ss: 0068 Process atalkd (pid: 551, threadinfo=d468e000 task=d6f55090) Stack: d468ff40 d468fee0 d70d20a0 0003 c036b6e0 d70d20a0 d70d20a0 d468fec0 d7e88200 0400 d468ff40 0003 d468fec0 d468fe18 bfa7c480 c02e2d5e d468fe18 d7194540 d468fec0 0003 Call Trace: [c036b6e0] atalk_recvmsg+0xf2/0x105 [c02e2d5e] sock_sendmsg+0xce/0xe9 [c01212c2]
Probably e1000 related Oops in 2.6.18-rc5
Hello, My testing workstation running 2.6.18-rc5 Oopsed. It has a dualport e1000 card with bonding and vlans. All I have is three fotos made by a digital camera: http://www.ans.pl/Oops/1/ Hope it is enough. Best regards, Krzysztof Olędzki
Re: PATCH Fix bonding active-backup behavior for VLAN interfaces
On Mon, 14 Aug 2006, David Miller wrote: From: Jay Vosburgh [EMAIL PROTECTED] Date: Thu, 03 Aug 2006 18:01:35 -0700 In this case (bond0.555 above bond0 above eth0,eth1,etc), skb_bond doesn't suppress duplicates because skb_bond is called with the skb-dev set to the bond0.555 dev, not the ethX dev. Non-accelerated VLAN devices don't do this; they'll come in with skb-dev set to ethX and will go through skb_bond as expected. Ok, since __vlan_hwaccel_rx() bypasses the netif_receive_skb() that would normally occur, we have to duplicate the bonding drop checks. The submitted patch put skb_bond() into if_vlan.h which is definitely the wrong thing to do. This is a generic operation and therefore belongs in linux/netdevice.h at best. Furthermore, we're only interested in the packet drop check, so that's the only part of the logic we need to export, the rest can stay private to skb_bond() in net/core/dev.c Can the folks who can reproduce this try this patch? Works for me, thank you. Acked-by: Krzysztof Piotr Oledzki [EMAIL PROTECTED] Best regards, Krzysztof Olędzki
Re: PATCH Fix bonding active-backup behavior for VLAN interfaces
On Thu, 3 Aug 2006, Krzysztof Oledzki wrote: On Wed, 2 Aug 2006, David Miller wrote: CUT Finally, I'm still a little stumped about why this change is necessary still, to be honest. If I understand it correctly this patch fixes the [PATCH] bonding: suppress duplicate packets patch: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=8f903c708fcc2b579ebf16542bf6109bad593a1d;hp=ebe19a4ed78d4a11a7e01cdeda25f91b7f2fcb5a It seems that the original patch does not work properly in vlan accelerated environment, which I reported 31 Mar 2006 http://marc.theaimsgroup.com/?l=bonding-develm=114381240718113w=2 Anyway, I didn't test this patch yet but I'm going to di it ASAP. OK, this patch really solves the bug from my report. Are there any chances for similar fix in the net-2.6.19.git? Best regards, Krzysztof Olędzki
Re: PATCH Fix bonding active-backup behavior for VLAN interfaces
On Wed, 2 Aug 2006, David Miller wrote: CUT Finally, I'm still a little stumped about why this change is necessary still, to be honest. If I understand it correctly this patch fixes the [PATCH] bonding: suppress duplicate packets patch: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=8f903c708fcc2b579ebf16542bf6109bad593a1d;hp=ebe19a4ed78d4a11a7e01cdeda25f91b7f2fcb5a It seems that the original patch does not work properly in vlan accelerated environment, which I reported 31 Mar 2006 http://marc.theaimsgroup.com/?l=bonding-develm=114381240718113w=2 Anyway, I didn't test this patch yet but I'm going to di it ASAP. Best regards, Krzysztof Olędzki
Re: problems with e1000 and jumboframes
On Thu, 3 Aug 2006, Benjamin LaHaise wrote: On Thu, Aug 03, 2006 at 03:48:39PM +0200, Arnd Hannemann wrote: However the box is a VIA Epia MII12000 with 1 GB of Ram and 1 GB of swap enabled, so there should be plenty of memory available. HIGHMEM support is off. The e1000 nic seems to be an 82540EM, which to my knowledge should support jumboframes. However I can't always reproduce this on a freshly booted system, so someone else may be the culprit and leaking pages? Any ideas how to debug this? This is memory fragmentation, and all you can do is work around it until the e1000 driver is changed to split jumbo frames up on rx. Here are a few ideas that should improve things for you: - switch to a 2GB/2GB split to recover the memory lost to highmem (see Processor Type and Features / Memory split) With 1 GB of RAM full 1GB/3GB (CONFIG_VMSPLIT_3G_OPT) seems to be enough... - increase /proc/sys/vm/min_free_kbytes -- more free memory will improve the odds that enough unfragmented memory is available for incoming network packets True. IMO, 65535 is a good starting point. Best regards, Krzysztof Olędzki
Re: problems with e1000 and jumboframes
On Thu, 3 Aug 2006, Benjamin LaHaise wrote: On Thu, Aug 03, 2006 at 04:49:15PM +0200, Krzysztof Oledzki wrote: With 1 GB of RAM full 1GB/3GB (CONFIG_VMSPLIT_3G_OPT) seems to be enough... Nope, you lose ~128MB of RAM for vmalloc space. No sure: Linux version 2.6.17.7 ([EMAIL PROTECTED]) (gcc version 3.4.6) #1 SMP PREEMPT Fri Jul 28 18:05:40 CEST 2006 BIOS-provided physical RAM map: BIOS-e820: - 000a (usable) BIOS-e820: 0010 - 3ffc (usable) BIOS-e820: 3ffc - 3ffcfc00 (ACPI data) BIOS-e820: 3ffcfc00 - 3000 (reserved) BIOS-e820: e000 - f000 (reserved) BIOS-e820: fec0 - fec9 (reserved) BIOS-e820: fed0 - fed00400 (reserved) BIOS-e820: fee0 - fee1 (reserved) BIOS-e820: ffb0 - 0001 (reserved) 1023MB LOWMEM available. found SMP MP-table at 000fe710 On node 0 totalpages: 262080 DMA zone: 4096 pages, LIFO batch:0 Normal zone: 257984 pages, LIFO batch:31 (...) $ zcat /proc/config.gz |grep VMSPLIT # CONFIG_VMSPLIT_3G is not set CONFIG_VMSPLIT_3G_OPT=y # CONFIG_VMSPLIT_2G is not set # CONFIG_VMSPLIT_1G is not set Best regards, Krzysztof Olędzki
Re: problems with e1000 and jumboframes
On Thu, 3 Aug 2006, Evgeniy Polyakov wrote: CUT Why? After your explanation that makes sense for me. The driver needs one contiguous chunk for those 9k packet buffer and thus requests a 3-order page of 16k. Or do i still do not understand this? Correct, except that it wants 32k. e1000 logic is following: align frame size to power-of-two, 16K? then skb_alloc adds a little (sizeof(struct skb_shared_info)) at the end, and this ends up in 32k request just for 9k jumbo frame. Strange, why this skb_shared_info cannon be added before first alignment? And what about smaller frames like 1500, does this driver behave similar (first align then add)? Best regards, Krzysztof Olędzki
Re: skge driver oops
On Sun, 23 Jul 2006, Krzysztof Oledzki wrote: On Fri, 26 May 2006, Stephen Hemminger wrote: Please give this a try, it rearranges the transmit buffer management, and may avoid issues with partial completions causing SKB reuse. CUT Plase excuse me, I overlooked this patch. Anyway, it seems that this fix went into the 2.6.16 kernel, which is already on the server that caused problems (http://bugzilla.kernel.org/show_bug.cgi?id=6142). I'll disable my workaround (/usr/sbin/ethtool -K eth1 tx off) and let you known about the results. Strange, I had reenabled tx csum and there were no problems for about one week. Yesterday I had upgraded my kernel to the 2.6.17.7 and after one day, about 3 hours ago, my system crashed with following log: 782b6fe4 skge_xmit_frame+0x121/0x2ea 781249b6 raise_softirq_irqoff+0xe/0x59 7833b9b7 qdisc_restart+0xc4/0x16b 78332352 net_tx_action+0x97/0xbd 7812484d __do_softirq+0x59/0xc0 781248e4 do_softirq+0x30/0x35 78124947 local_bh_enable+0x5e/0x7e 78332194 dev_queue_xmit+0x1b6/0x1bd 7834ab2c ip_output+0x1b5/0x1eb 7834af00 ip_queue_xmit+0x39e/0x3e6 78191f3e __ext3_get_inode_loc+0x53/0x201 7819df94 journal_dirty_metadata+0x1d1/0x1eb 7811bafb __wake_up+0x27/0x3b 7819e3dc journal_stop+0x1bd/0x1c9 781963d0 __ext3_journal_stop+0x19/0x37 78192b58 ext3_dirty_inode+0x5d/0x63 78359652 tcp_transmit_skb+0x38e/0x3af 7816d122 touch_atime+0x97/0x9d 7835a89c tcp_write_xmit+0x1ad/0x212 7835a924 __tcp_push_pending_frames+0x23/0x80 78352732 do_tcp_setsockopt+0x12e/0x2f3 7832cd3c sock_common_setsockopt+0x1e/0x22 7832ac7b sys_setsockopt+0x61/0x81 7832b242 sys_socketcall+0x164/0x1a4 7815765d sys_sendfile+0x5d/0x84 78102c93 sysenter_past_esp+0x54/0x75 Bad page state in process 'swapper' page:7985eb20 flags:0x80010008 mapping:e25867a0 mapcount:0 count:0 Trying to fix it up, but a reboot is needed Backtrace: 78140e43 bad_page+0x43/0x6c 781415e5 free_hot_cold_page+0x5b/0x123 7832d700 skb_release_data+0x50/0x86 7832d741 kfree_skbmem+0xb/0x70 78355b41 tcp_clean_rtx_queue+0x225/0x3e6 783560b1 tcp_ack+0x151/0x27b 78358116 tcp_rcv_established+0x544/0x5ed 7835e972 tcp_v4_do_rcv+0x1f/0xb4 7835ee8e tcp_v4_rcv+0x487/0x6de 7833f4ef nf_hook_slow+0xb3/0xce 78347aac ip_local_deliver+0x11b/0x1ab 78348086 ip_rcv+0x40c/0x446 783324e7 netif_receive_skb+0x16f/0x1a7 782b79a0 skge_poll+0x307/0x3e8 78332661 net_rx_action+0x5c/0xd3 7812484d __do_softirq+0x59/0xc0 781248e4 do_softirq+0x30/0x35 7812499d irq_exit+0x36/0x41 78104edc do_IRQ+0x20/0x28 7810101c default_idle+0x0/0x55 7810373e common_interrupt+0x1a/0x20 7810101c default_idle+0x0/0x55 78101048 default_idle+0x2c/0x55 78101132 cpu_idle+0xad/0xda I know it is incomplete (this is all what I am able to find in my logs) but it looks _very_ similar to the one from: http://bugzilla.kernel.org/show_bug.cgi?id=6142 BTW: During normal work skge driver still logs (about 10 times per 1 hour) informations about hardware error. However, message changed slightly - in 2.6.16 it was: skge hardware error detected (status 0x400) but in 2.6.17 it is: skge :00:0b.0: PCI error cmd=0x7 status=0x82b0 skge :00:0b.0: PCI error cmd=0x147 status=0xc2b0 skge :00:0b.0: PCI error cmd=0x147 status=0xc2b0 skge :00:0b.0: PCI error cmd=0x147 status=0xc2b0 skge :00:0b.0: PCI error cmd=0x147 status=0xc2b0 skge :00:0b.0: PCI error cmd=0x147 status=0xc2b0 (...) Anyway, everything works fine. I don't know if it is somehow related to mentioned crashes. Best regards, Krzysztof Oledzki - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: skge driver oops
On Fri, 26 May 2006, Stephen Hemminger wrote: Please give this a try, it rearranges the transmit buffer management, and may avoid issues with partial completions causing SKB reuse. CUT Plase excuse me, I overlooked this patch. Anyway, it seems that this fix went into the 2.6.16 kernel, which is already on the server that caused problems (http://bugzilla.kernel.org/show_bug.cgi?id=6142). I'll disable my workaround (/usr/sbin/ethtool -K eth1 tx off) and let you known about the results. Thank you. Best regards, Krzysztof Olędzki
Re: [e1000]: flow control on by default - good idea really?
On Wed, 5 Jul 2006, Auke Kok wrote: jamal wrote: On Tue, 2006-04-07 at 13:11 -0400, jamal wrote: I have a device connected to a e1000 that was erroneously advertising both tx/rx flow control but wasnt properly reacting to it. The default setup on the e1000 has rx flow control turned on. I was sending at wire rate gige from the device - which is about 1.48Mpps. The e1000 was in turn sending me flow control packets as per default/expected behavior. Unfortunately, it was sending a very large amount of packets. At one point i was seeing upto 1Mpps and on average, the flow control packets were consuming 60-70% of the bandwidth. Even when i fixed this behavior to act properly, allowing flow control on consumed up to 15% of the bandwidth. Clearly, this is a bad thing. Yes, the device in the first instance was at fault. But i have argued in the past that NAPI does just fine without flow control being turned on, so even chewing 5% of bandwidth on flow control is a bad thing.. As a compromise, can we declare flow control as an advanced feature and turn it off by default? People who feel it is valuable and know what they are doing can turn it off. I meant turn it on. BTW, As an addendum this default behavior changed around 2.6.16 it seems. Flow Control is using the EEPROM provided value, the module driver itself does not choose a default: e1000_param.c: /* User Specified Flow Control Override * * Valid Range: 0-3 * - 0 - No Flow Control * - 1 - Rx only, respond to PAUSE frames but do not generate them * - 2 - Tx only, generate PAUSE frames but ignore them on receive * - 3 - Full Flow Control Support * * Default Value: Read flow control settings from the EEPROM */ Turning flow control off usually (i.e. almost always) causes (significantly) _degraded_ performance. We should really leave it the way it is (as per eeprom setting), and this is best for most if not all people. The card itself has this value programmed, which makes it possible for the user to turn on/off flowcontrol per card consistently, which makes much more sense to me. Also considering e1000 hardware varies significantly. I was never able to find such tool for Linux or at least DOS. Where should I look for it? Best regards, Krzysztof Olędzki
Re: [e1000]: flow control on by default - good idea really?
On Wed, 5 Jul 2006, Auke Kok wrote: David Miller wrote: From: jamal [EMAIL PROTECTED] Date: Tue, 04 Jul 2006 15:20:39 -0400 BTW, As an addendum this default behavior changed around 2.6.16 it seems. Flow control has been on by default in the tg3 driver since the beginning, maybe e1000 only recently started to behave that way but it's the right thing to do IMHO. As said earlier, e1000 always honors the EEPROM setting for this, which has been _on_ by default for all cards (AFAIK, that is). I'm not sure: [EMAIL PROTECTED]:~# mii-tool -v eth0 eth0: negotiated 100baseTx-FD, link ok product info: vendor 00:aa:00, model 56 rev 0 basic mode: autonegotiation enabled basic status: autonegotiation complete, link ok capabilities: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD advertising: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD link partner: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control [EMAIL PROTECTED]:~# ethtool -d eth0|grep flow Receive flow control: disabled Transmit flow control: disabled [EMAIL PROTECTED]:~# uname -r 2.6.14.3 [EMAIL PROTECTED]:~# mii-tool -v eth0 eth0: negotiated 100baseTx-FD flow-control, link ok product info: vendor 00:aa:00, model 56 rev 0 basic mode: autonegotiation enabled basic status: autonegotiation complete, link ok capabilities: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD advertising: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control link partner: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control [EMAIL PROTECTED]:~# ethtool -d eth0|grep flow Receive flow control: enabled Transmit flow control: enabled [EMAIL PROTECTED]:~# uname -r 2.6.16.19 This is exactly the same hardware, only kernel was recently upgraded on the r2. Best regards, Krzysztof Olędzki
Re: [PATCH UPDATE netdev-2.6.git] bonding: suppress duplicate packets
On Fri, 31 Mar 2006, Jay Vosburgh wrote: Krzysztof Oledzki [EMAIL PROTECTED] wrote: [...] I took this patch from linux-2.6 using git tree and applied to 2.6.16.1 together with recent link status fix. Unfortunately broadcast packet duplication still occurs. I am unable to induce any duplicate packets using the current netdev-2.6.git upstream branch (which should be the same bonding driver as you're using). I tried it with and without VLANs, using ping to various addresses (unicast, subnet broadcast, all-1s broadcast). I'm using a Cisco switch, and I'm issuing the IOS command clear mac address-table dynamic to induce it to (briefly) flood traffic to all ports. The only duplicates I see are ping pointing out duplicate returns from the multiple stations on the network. I don't see bonding delivering two copies of the same packet. Using the unmodified 2.6.16.1 kernel, I do see multiple copies of the same packet from a ping to the broadcast address using the method I describe above. Thank you for your tests and fast response. I am using a different network device (tg3), although I'm not sure how that would affect this. Probably this is not releated. Under what circumstances are you seeing duplicates, and what type of traffic is it? If I set net.ipv4.icmp_echo_ignore_broadcasts=0 I'm seeing duplicates while pinging broadcast address: # ping 192.168.149.255 -b WARNING: pinging broadcast address PING 192.168.149.255 (192.168.149.255) 56(84) bytes of data. 64 bytes from 192.168.149.21: icmp_seq=1 ttl=128 time=0.159 ms 64 bytes from 192.168.149.2: icmp_seq=1 ttl=64 time=0.267 ms (DUP!) 64 bytes from 192.168.149.11: icmp_seq=1 ttl=128 time=0.279 ms (DUP!) 64 bytes from 192.168.149.2: icmp_seq=1 ttl=64 time=0.288 ms (DUP!) 64 bytes from 192.168.149.10: icmp_seq=1 ttl=128 time=0.295 ms (DUP!) Please notice that 192.168.149.2 responded two times. If I run tcpdump on 192.168.149.2 it shows: [EMAIL PROTECTED]:~# tcpdump -i vlan19 -n icmp -e tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on vlan19, link-type EN10MB (Ethernet), capture size 96 bytes 15:41:07.512007 00:14:22:b0:cb:52 ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 98: 192.168.149.3 192.168.149.255: ICMP echo request, id 27686, seq 1, length 64 15:41:07.512111 00:14:22:b0:c9:f9 00:14:22:b0:cb:52, ethertype IPv4 (0x0800), length 98: 192.168.149.2 192.168.149.3: ICMP echo reply, id 27686, seq 1, length 64 15:41:07.512139 00:14:22:b0:cb:52 ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 98: 192.168.149.3 192.168.149.255: ICMP echo request, id 27686, seq 1, length 64 15:41:07.512160 00:14:22:b0:c9:f9 00:14:22:b0:cb:52, ethertype IPv4 (0x0800), length 98: 192.168.149.2 192.168.149.3: ICMP echo reply, id 27686, seq 1, length 64 So it seems I must have done something wrong but I have no idea what? Wrong patch? I'm using exactly this one: ftp://ftp.ans.pl/pub/patches/0140-bonding_suppress_duplicate_packets.patch Best regards, Krzysztof Olędzki
Re: [PATCH UPDATE netdev-2.6.git] bonding: suppress duplicate packets
On Tue, 21 Feb 2006, Jay Vosburgh wrote: Originally submitted by Kenzo Iwami; his original description is: The current bonding driver receives duplicate packets when broadcast/ multicast packets are sent by other devices or packets are flooded by the switch. In this patch, new flags are added in priv_flags of net_device structure to let the bonding driver discard duplicate packets in dev.c:skb_bond(). Modified by Jay Vosburgh to change a define name, update some comments, rearrange the new skb_bond() for clarity, clear all bonding priv_flags on slave release, and update the driver version. Signed-off-by: Kenzo Iwami [EMAIL PROTECTED] Signed-off-by: Jay Vosburgh [EMAIL PROTECTED] CUT I took this patch from linux-2.6 using git tree and applied to 2.6.16.1 together with recent link status fix. Unfortunately broadcast packet duplication still occurs. I have two e1000 NICs: 02:04.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit Ethernet Controller (rev 05) 04:03.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit Ethernet Controller (rev 05) My configuration follows: # echo -n 1 /sys/class/net/bond0/bonding/mode # echo -n 100 /sys/class/net/bond0/bonding/miimon # /sbin/ifconfig bond0 up # ifenslave bond0 eth0 eth1 # echo -n eth0 /sys/class/net/bond0/bonding/primary # cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.0.3 (March 23, 2006) Bonding Mode: fault-tolerance (active-backup) Primary Slave: eth0 Currently Active Slave: eth0 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Slave Interface: eth0 MII Status: up Link Failure Count: 7 Permanent HW addr: 00:14:22:b0:c9:f9 Slave Interface: eth1 MII Status: up Link Failure Count: 7 Permanent HW addr: 00:14:22:b0:c9:fa [EMAIL PROTECTED]:~# /sbin/ifconfig eth0 eth0 Link encap:Ethernet HWaddr 00:14:22:B0:C9:F9 inet6 addr: fe80::214:22ff:feb0:c9f9/64 Scope:Link UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:612084 errors:0 dropped:0 overruns:0 frame:0 TX packets:720804 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:65497223 (62.4 Mb) TX bytes:193356963 (184.3 Mb) Base address:0xecc0 Memory:fe9e-fea0 [EMAIL PROTECTED]:~# /sbin/ifconfig eth1 eth1 Link encap:Ethernet HWaddr 00:14:22:B0:C9:F9 inet6 addr: fe80::214:22ff:feb0:c9f9/64 Scope:Link UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:85134 errors:0 dropped:0 overruns:0 frame:0 TX packets:1161 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:10730231 (10.2 Mb) TX bytes:93436 (91.2 Kb) Base address:0xdcc0 Memory:fe5e-fe60 I'm using .1Q vlans over bondig interface: # cat /proc/net/vlan/config VLAN Dev name| VLAN ID Name-Type: VLAN_NAME_TYPE_PLUS_VID_NO_PAD vlan1 | 1 | bond0 vlan2 | 2 | bond0 vlan3 | 3 | bond0 vlan4 | 4 | bond0 vlan5 | 5 | bond0 vlan6 | 6 | bond0 vlan7 | 7 | bond0 vlan18 | 18 | bond0 vlan19 | 19 | bond0 vlan33 | 33 | bond0 vlan34 | 34 | bond0 vlan37 | 37 | bond0 vlan66 | 66 | bond0 Any ideas? Best regards, Krzysztof Olędzki
Re: [e1000 debug] KERNEL: assertion (!sk_forward_alloc) failed...
On Wed, 29 Mar 2006, Brandeburg, Jesse wrote: Hi all, I've identified you as people who have at some point in the past emailed one of the Linux lists with problems with e1000 and sk_forward_alloc. It seems to be fairly widespread, but only seems to have appeared with recent kernel changes (after 2.6.12...) What I need from you is a reproducible test, and some information. I have never been able to reproduce this, and I'm trying to isolate the problem a bit. What motherboards are you using? RIOWORKS/PDRCA. # lspci 00:00.0 Host bridge: ServerWorks GCNB-LE Host Bridge (rev 32) 00:00.1 Host bridge: ServerWorks GCNB-LE Host Bridge 00:02.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27) 00:03.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 02) 00:04.0 I2O: Adaptec (formerly DPT) SmartRAID V Controller (rev 01) 00:04.1 PCI bridge: Adaptec (formerly DPT) PCI Bridge (rev 01) 00:06.0 Ethernet controller: D-Link System Inc DL2000-based Gigabit Ethernet (rev 0c) 00:08.0 SCSI storage controller: Initio Corporation INI-A100U2W (rev 01) 00:0e.0 IDE interface: ServerWorks CSB6 IDE Controller (rev a0) 00:0f.0 Host bridge: ServerWorks CSB6 South Bridge (rev a0) 00:0f.1 IDE interface: ServerWorks CSB6 RAID/IDE Controller (rev a0) 00:0f.2 USB Controller: ServerWorks CSB6 OHCI USB Controller (rev 05) 00:0f.3 ISA bridge: ServerWorks GCLE-2 Host Bridge What seems to cause this problem? Don't known as this problem occurs only occasionally. Are you all using iptables? Yes, this is a www proxy server with -j REDIRECT. Are you all routing? Some kind as this is a transparent www proxy. From the reports I assume none of you are using an 82571/2/3 (pci express) 00:03.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 02) Subsystem: Rioworks: Unknown device 3011 Flags: bus master, 66Mhz, medium devsel, latency 64, IRQ 169 Memory at d000 (32-bit, non-prefetchable) [size=128K] I/O ports at 2c00 [size=64] Capabilities: [dc] Power Management version 2 Capabilities: [e4] PCI-X non-bridge device. Capabilities: [f0] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable- Thank you. Best regards, Krzysztof Olędzki
Re: [e1000 debug] KERNEL: assertion (!sk_forward_alloc) failed...
On Thu, 30 Mar 2006, Mark Nipper wrote: On 29 Mar 2006, Brandeburg, Jesse wrote: What I need from you is a reproducible test, and some information. I have never been able to reproduce this, and I'm trying to isolate the problem a bit. What motherboards are you using? What seems to cause this problem? Are you all using iptables? Are you all routing? From the reports I assume none of you are using an 82571/2/3 (pci express) Unfortunately, my problem machine is a remote, leased server, so I'd have to ask my provider for information on the motherboard. You can probably check this with the dmidecode tool. Best regards, Krzysztof Olędzki
Re: [e1000 debug] KERNEL: assertion (!sk_forward_alloc) failed...
On Thu, 30 Mar 2006, Phil Oester wrote: On 29 Mar 2006, Brandeburg, Jesse wrote: What I need from you is a reproducible test, and some information. I From all the reports which have come in thus far, it seems everyone has 1 e1000. One person even reported that removing one of the two nics solved the problem for him. Does this help narrow down the search area? I have only one. Anyway, this massage happens _very_ occasionally in my case. Best regrads, Krzysztof Olędzki
Re: KERNEL: assertion (!sk-sk_forward_alloc) failed
On Fri, 10 Mar 2006, David S. Miller wrote: From: Ian McDonald [EMAIL PROTECTED] Date: Fri, 10 Feb 2006 08:37:48 +1300 On 2/10/06, Boris B. Zhmurov [EMAIL PROTECTED] wrote: Hello, Ian McDonald. On 09.02.2006 22:25 you said the following: Is it possible for you to download 2.6.16-rc2 or similar and see if it goes away? It'll be better, if I get only patch fixs that problem, not all 2.6.16-rc2. Oops I didn't read Jesse's message earlier properly. That patch which probably fixed it is (from his message): I think the commit id that is missing from 2.6.14.X is fb5f5e6e0cebd574be737334671d1aa8f170d5f3 This patch is in the linux-2.6.14 stable tree, I just verified this. So it must be another problem: I had this message with 2.6.15.2: KERNEL: assertion (!sk-sk_forward_alloc) failed at net/core/stream.c (279) KERNEL: assertion (!sk-sk_forward_alloc) failed at net/ipv4/af_inet.c (148) Best regards, Krzysztof Olędzki
Re: Fw: [Bugme-new] [Bug 5946] New: KERNEL: assertion (!sk-sk_forward_alloc) failed at net/core/stream.c (279)
On Tue, 24 Jan 2006, Andrew Morton wrote: Begin forwarded message: Date: Tue, 24 Jan 2006 00:11:51 -0800 From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: [Bugme-new] [Bug 5946] New: KERNEL: assertion (!sk-sk_forward_alloc) failed at net/core/stream.c (279) http://bugzilla.kernel.org/show_bug.cgi?id=5946 Summary: KERNEL: assertion (!sk-sk_forward_alloc) failed at net/core/stream.c (279) KERNEL: assertion (!sk-sk_forward_alloc) failed at net/ipv4/af_inet.c (148) Kernel Version: 2.6.15.1 Status: NEW Severity: normal Owner: [EMAIL PROTECTED] Submitter: [EMAIL PROTECTED] Most recent kernel where this bug did not occur: 2.6.13 Distribution: Gentoo Hardware Environment: P4 3.2GHz 2x e1000 driver 4GB RAM 8 SCSI DISCS Software Environment: Squid Problem Description: dmesg shows : KERNEL: assertion (!sk-sk_forward_alloc) failed at net/core/stream.c (279) KERNEL: assertion (!sk-sk_forward_alloc) failed at net/ipv4/af_inet.c (148) Just found the same message on my logs, so me too. /var/log/old/syslog.19:Jan 4 16:20:57 bizon kernel: KERNEL: assertion (!sk-sk_forward_alloc) failed at net/core/stream.c (279) /var/log/old/syslog.19:Jan 4 16:20:58 bizon kernel: KERNEL: assertion (!sk-sk_forward_alloc) failed at net/ipv4/af_inet.c (148) This happend only once, it was with the 2.6.14.2 kernel. It is a dual Xeon server with HT (4 logicals CPU total) running Slackware (NPTL) with apache, mysql, squid, sendmail, amavis, clamav, spamassassin, pop3/imap (courier). # zcat /proc/config.gz |grep PRE # CONFIG_PREEMPT_NONE is not set # CONFIG_PREEMPT_VOLUNTARY is not set CONFIG_PREEMPT=y CONFIG_PREEMPT_BKL=y CONFIG_PREVENT_FIRMWARE_BUILD=y CONFIG_DEBUG_PREEMPT=y Best regards, Krzysztof Olędzki
Re: [Ipsec-tools-devel] Re: [PATCH]: Re: SA switchover
On Sun, 18 Dec 2005, David S. Miller wrote: From: David S. Miller [EMAIL PROTECTED] Date: Sun, 18 Dec 2005 13:20:19 -0800 (PST) From: Krzysztof Oledzki [EMAIL PROTECTED] Date: Sun, 18 Dec 2005 17:49:50 +0100 (CET) At 17:31:26 kernel executed the one from xfrm_state_add() (Ole #2) but it didn't help. :( Thanks for testing, I'll try to figure out what might be going on. Ok, xfrm_flush_bundles() isn't pruning the bundles because they still look valid. We fix this by adding a xfrm_flush_all_bundles() that doesn't do the validity check and simply flushes everything. Please give this new version of the patch a try, thanks. OK. With this patch kernel switches to new SA immediately, but only for ping. TCP (ssh) session between Cisco and Linux is still protected by the old SA. Tested by running two tests simultaneously: - while true ; do echo -ne . ; sleep 1; done over ssh - ping Both protected by the same ipsec policy. ssh: 10:21:58.376530 IP 192.168.0.24 192.168.0.7: ESP(spi=0x00648e34,seq=0x17c) 10:21:58.376856 IP 192.168.0.7 192.168.0.24: ESP(spi=0x1acf2fac,seq=0x17c) ping: 10:21:58.943229 IP 192.168.0.7 192.168.0.24: ESP(spi=0x1acf2fac,seq=0x17d) 10:21:58.947768 IP 192.168.0.24 192.168.0.7: ESP(spi=0x00648e34,seq=0x17d) ssh: 10:21:59.396334 IP 192.168.0.24 192.168.0.7: ESP(spi=0x00648e34,seq=0x17e) 10:21:59.396664 IP 192.168.0.7 192.168.0.24: ESP(spi=0x1acf2fac,seq=0x17e) ping: 10:21:59.944079 IP 192.168.0.7 192.168.0.24: ESP(spi=0x1acf2fac,seq=0x17f) 10:21:59.971934 IP 192.168.0.24 192.168.0.7: ESP(spi=0x00648e34,seq=0x17f) * New SA was negotiated: Dec 19 10:22:00 chochlik racoon: INFO: IPsec-SA established: ESP/Tunnel 192.168.0.24[0]-192.168.0.7[0] spi=228316027(0xd9bd37b) Dec 19 10:22:00 chochlik racoon: INFO: IPsec-SA established: ESP/Tunnel 192.168.0.7[0]-192.168.0.24[0] spi=3587656557(0xd5d74b6d) * Cisco switched to the new SA immediately, Linux switched only partially: ssh: 10:22:00.416215 IP 192.168.0.24 192.168.0.7: ESP(spi=0x0d9bd37b,seq=0x1) 10:22:00.416607 IP 192.168.0.7 192.168.0.24: ESP(spi=0x1acf2fac,seq=0x180) ping: 10:22:00.944950 IP 192.168.0.7 192.168.0.24: ESP(spi=0xd5d74b6d,seq=0x1) 10:22:00.949622 IP 192.168.0.24 192.168.0.7: ESP(spi=0x0d9bd37b,seq=0x2) ssh: 10:22:01.436183 IP 192.168.0.24 192.168.0.7: ESP(spi=0x0d9bd37b,seq=0x3) 10:22:01.436523 IP 192.168.0.7 192.168.0.24: ESP(spi=0x1acf2fac,seq=0x181) ping: 10:22:01.945777 IP 192.168.0.7 192.168.0.24: ESP(spi=0xd5d74b6d,seq=0x2) 10:22:01.950323 IP 192.168.0.24 192.168.0.7: ESP(spi=0x0d9bd37b,seq=0x4) (...) * Executed ip route flush cache: ssh: 10:22:16.743559 IP 192.168.0.24 192.168.0.7: ESP(spi=0x0d9bd37b,seq=0x21) 10:22:16.744028 IP 192.168.0.7 192.168.0.24: ESP(spi=0xd5d74b6d,seq=0x11) ping: 10:22:16.959512 IP 192.168.0.7 192.168.0.24: ESP(spi=0xd5d74b6d,seq=0x12) 10:22:16.964147 IP 192.168.0.24 192.168.0.7: ESP(spi=0x0d9bd37b,seq=0x22) Best regards, Krzysztof Olędzki
Re: [PATCH]: Re: SA switchover
On Mon, 19 Dec 2005, jamal wrote: On Mon, 2005-19-12 at 13:57 -0800, David S. Miller wrote: From: jamal [EMAIL PROTECTED] Date: Mon, 19 Dec 2005 08:17:19 -0500 Just an addendum: If this works it should be sysctl controlled i hope. There is absolutely no reason for that, so no :) Well, we went from use old SA to use new SA policy;- No, we went from use both new and old SA to always use the same (new) SA. Adding a sysctl for keeping kernel buggy is totally wrong. best regards, Krzysztof Olędzki
Re: [Ipsec-tools-devel] Re: [PATCH]: Re: SA switchover
On Mon, 19 Dec 2005, David S. Miller wrote: From: Krzysztof Oledzki [EMAIL PROTECTED] Date: Mon, 19 Dec 2005 10:37:14 +0100 (CET) OK. With this patch kernel switches to new SA immediately, but only for ping. TCP (ssh) session between Cisco and Linux is still protected by the old SA. Ok, we're making progress :-) When the bundles get flushed, xfrm_prune_bundles() accumulates all the per-policy bundles into a list and runs dst_free() on each and every one. Unless marked obsolete already (these dst's should not be marked obsolete), it invokes __dst_free() which marks the dst as obsolete and this in turn should trigger the cached socket route check here in __sk_dst_check(). static inline struct dst_entry * __sk_dst_check(struct sock *sk, u32 cookie) { struct dst_entry *dst = sk-sk_dst_cache; if (dst dst-obsolete dst-ops-check(dst, cookie) == NULL) { sk-sk_dst_cache = NULL; dst_release(dst); return NULL; } return dst; } Oh, that's the bug, dst-ops-check() is xfrm_dst_check(). That tests validity using stable_bundle() which thinks the dst is still valid. Please add these two lines: if (dst-obsolete) return NULL; at the beginning of xfrm_dst_check() and all should be fine. Yes, it works now perfectly: 06:19:09.363154 IP 192.168.0.24 192.168.0.7: ESP(spi=0x03456676,seq=0x145) 06:19:09.363548 IP 192.168.0.7 192.168.0.24: ESP(spi=0x4fd702b2,seq=0x166) 06:19:09.736632 IP 192.168.0.7 192.168.0.24: ESP(spi=0x4fd702b2,seq=0x167) 06:19:09.741256 IP 192.168.0.24 192.168.0.7: ESP(spi=0x03456676,seq=0x146) Dec 20 06:19:10 chochlik racoon: INFO: IPsec-SA established: ESP/Tunnel 192.168.0.24[0]-192.168.0.7[0] spi=72688259(0x4552283) Dec 20 06:19:10 chochlik racoon: INFO: IPsec-SA established: ESP/Tunnel 192.168.0.7[0]-192.168.0.24[0] spi=671780776(0x280a8fa8) 06:19:10.382903 IP 192.168.0.24 192.168.0.7: ESP(spi=0x04552283,seq=0x1) 06:19:10.383364 IP 192.168.0.7 192.168.0.24: ESP(spi=0x280a8fa8,seq=0x1) 06:19:10.737511 IP 192.168.0.7 192.168.0.24: ESP(spi=0x280a8fa8,seq=0x2) 06:19:10.742083 IP 192.168.0.24 192.168.0.7: ESP(spi=0x04552283,seq=0x2) Dziekuje bardzo for all of your testing so far Krzysztof. Dziekuje bardzo ;) Best regards, Krzysztof Olędzki
Re: [Ipsec-tools-devel] Re: [PATCH]: Re: SA switchover
On Thu, 15 Dec 2005, David S. Miller wrote: From: David S. Miller [EMAIL PROTECTED] Date: Thu, 15 Dec 2005 17:52:54 -0800 (PST) diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index 7cf48aa..25dd8f4 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c Sorry, that patch was incomplete, please try this one instead: It does not work. :( 192.168.0.7 - Linux 192.168.0.24 - Cisco Tested it by running ping directly from Linux IPSec gateway: 17:31:22.830181 IP 192.168.0.7 192.168.0.24: ESP(spi=0x4ca5896a,seq=0x57) 17:31:22.834761 IP 192.168.0.24 192.168.0.7: ESP(spi=0x0a91a2ae,seq=0x57) 17:31:23.830997 IP 192.168.0.7 192.168.0.24: ESP(spi=0x4ca5896a,seq=0x58) 17:31:23.835811 IP 192.168.0.24 192.168.0.7: ESP(spi=0x0a91a2ae,seq=0x58) 17:31:24.831855 IP 192.168.0.7 192.168.0.24: ESP(spi=0x4ca5896a,seq=0x59) 17:31:24.836430 IP 192.168.0.24 192.168.0.7: ESP(spi=0x0a91a2ae,seq=0x59) 17:31:25.832692 IP 192.168.0.7 192.168.0.24: ESP(spi=0x4ca5896a,seq=0x5a) 17:31:25.837190 IP 192.168.0.24 192.168.0.7: ESP(spi=0x0a91a2ae,seq=0x5a) New IPsec-SA was negotiated: Dec 18 17:31:26 chochlik racoon: INFO: respond new phase 2 negotiation: 192.168.0.7[0]=192.168.0.24[0] Dec 18 17:31:26 chochlik racoon: INFO: IPsec-SA established: ESP/Tunnel 192.168.0.24[0]-192.168.0.7[0] spi=132988380(0x7ed3ddc) Dec 18 17:31:26 chochlik racoon: INFO: IPsec-SA established: ESP/Tunnel 192.168.0.7[0]-192.168.0.24[0] spi=1929290090(0x72fea16a) Cisco switched to the new SA immediately: 17:31:26.833579 IP 192.168.0.7 192.168.0.24: ESP(spi=0x4ca5896a,seq=0x5b) 17:31:26.838184 IP 192.168.0.24 192.168.0.7: ESP(spi=0x07ed3ddc,seq=0x1) 17:31:27.834389 IP 192.168.0.7 192.168.0.24: ESP(spi=0x4ca5896a,seq=0x5c) 17:31:27.839044 IP 192.168.0.24 192.168.0.7: ESP(spi=0x07ed3ddc,seq=0x2) 17:31:28.835245 IP 192.168.0.7 192.168.0.24: ESP(spi=0x4ca5896a,seq=0x5d) 17:31:28.839843 IP 192.168.0.24 192.168.0.7: ESP(spi=0x07ed3ddc,seq=0x3) 17:31:29.836088 IP 192.168.0.7 192.168.0.24: ESP(spi=0x4ca5896a,seq=0x5e) 17:31:29.840708 IP 192.168.0.24 192.168.0.7: ESP(spi=0x07ed3ddc,seq=0x4) Executed ip route flush cache, linux switched to the new SA: 17:31:30.837009 IP 192.168.0.7 192.168.0.24: ESP(spi=0x72fea16a,seq=0x1) 17:31:30.841616 IP 192.168.0.24 192.168.0.7: ESP(spi=0x07ed3ddc,seq=0x5) 17:31:31.837779 IP 192.168.0.7 192.168.0.24: ESP(spi=0x72fea16a,seq=0x2) 17:31:31.842349 IP 192.168.0.24 192.168.0.7: ESP(spi=0x07ed3ddc,seq=0x6) 17:31:32.838647 IP 192.168.0.7 192.168.0.24: ESP(spi=0x72fea16a,seq=0x3) 17:31:32.843224 IP 192.168.0.24 192.168.0.7: ESP(spi=0x07ed3ddc,seq=0x7) 17:31:33.839475 IP 192.168.0.7 192.168.0.24: ESP(spi=0x72fea16a,seq=0x4) 17:31:33.985697 IP 192.168.0.24 192.168.0.7: ESP(spi=0x07ed3ddc,seq=0x8) (...) I also added two printks to check if schedule_work is executed: diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index 7cf48aa..f255e97 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -431,6 +431,9 @@ void xfrm_state_insert(struct xfrm_state spin_lock_bh(xfrm_state_lock); __xfrm_state_insert(x); spin_unlock_bh(xfrm_state_lock); + + printk(Ole #1\n); + xfrm_state_gc_flush_bundles = 1; + schedule_work(xfrm_state_gc_work); } EXPORT_SYMBOL(xfrm_state_insert); @@ -478,6 +481,11 @@ out: spin_unlock_bh(xfrm_state_lock); xfrm_state_put_afinfo(afinfo); + if (err == 0) { + printk(Ole #2\n); + xfrm_state_gc_flush_bundles = 1; + schedule_work(xfrm_state_gc_work); + } + if (x1) { xfrm_state_delete(x1); xfrm_state_put(x1); At 17:31:26 kernel executed the one from xfrm_state_add() (Ole #2) but it didn't help. :( Sorry, it took me so long but now I have everything ready so I can make more tests. Best regards, Krzysztof Olędzki
Re: [PATCH]: Re: SA switchover
On Thu, 15 Dec 2005, David S. Miller wrote: From: David S. Miller [EMAIL PROTECTED] Date: Thu, 15 Dec 2005 17:52:54 -0800 (PST) diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index 7cf48aa..25dd8f4 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c Sorry, that patch was incomplete, please try this one instead: diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index 7cf48aa..f255e97 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c CUT Thank you! Will test ASAP. Need day or two, I need to reassemble my IPSec netlab. ;) Best regards, Krzysztof Olędzki
Re: [Ipsec-tools-devel] Re: SA switchover
On Thu, 15 Dec 2005, jamal wrote: Agree. It is a _workaround_;- A good one in my opinion. Given that it works for CISCOs, a very large piece of the problem is resolved, no? No, again, it does not help. I explained it in my previous mail. It will help 100% of the time _if you know_ you have CISCOs on the other end and you configure racoon with that in mind. In other words it doesnt matter who the initiator/responder is in this case. It does matter. This problem does not exist when cisco acts as responder, this problem does not exist when linksys acts as initiator. Do you disagree with this? Yes ;) Other people who have tried the patch dont seem to agree with your thesis. Not sure: -- Forwarded message BEGIN -- Date: Mon, 21 Nov 2005 15:31:54 +0100 From: [EMAIL PROTECTED] To: jamal [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: [Ipsec-tools-devel] Forcing SA soft limit? I could give you a patch that forces the soft limit. I have not tested it and have not seen interest from the racoon folks to incorporate it. Talk to me privately. Unfortunately setting the soft limit does not solve my problem. I tried recompiling from the sources setting my wanted soft limit. The problem turned out to be the peer (a cheapo DrayTek Vigor 2500 router) which discard the old SA before the hard limit expires and without agreeing the revoke with Racoon. -- Forwarded message END -- Very simple and accurate explanation. You are the one who opened the bug - have you tried the patch? ;- It adds very useful feature but does not solve my problem. Well, as strange as this may sound, actually it may not be that unreasonable to dynamically make the policy decision;- we know which devices have problems. Really? I don't think so. OK, let me ask you this: When you configure use new SA - are you making assumption about what is on the other end? in other words, you have knowledge of the end device to assume it will start accepting the new SA immediately. Sometime yes, sometime no. Generally: no. Is there no way in racoon that someone will get the vendor id in an external C program or shell script and then they reconfigure IKE parameters for that peer only? If yes, then one could create a little script or C program that sets the softime for ciscos. But what if the same problem exists in other IPSec implementations that can not be detected by vendor ID? then you will need to use the admins brain as a last resort i.e. no different than you making the assumption that the other end is respecting use new SA. In any case - what we need to do is fix this issue and not argue semantics of the RFC. IMO, its a screw up in the RFC definition. True. I can accept any fix, as long as it is going to _solve_ the problem. For me both kernel or racoon fixes are totally fine. But please notice that dirty workarounds are not going to fix this and I alredy have one (echo -ne -1 /proc/sys/net/ipv4/route/flush after negotiating each new SA). It works and it is ugly. Very ugly. ;) Best regards, Krzysztof Olędzki
Re: [Ipsec-tools-devel] Re: SA switchover
On Thu, 15 Dec 2005, David S. Miller wrote: 1) I don't understand how a routing cache flush fixes the problem. The routing cache flush only marks non-IPSEC cached routes as invalid, not IPSEC ones. New IPsec SA is used for communication between new src/dst (previously unseend) pair even if old SA exist. Only communication for src/dst, which was previously active, is stucked with old SA. I was also surprised that routing cache flush helps but it really works and I have used this workaround for more than three months. It looks like XFRM caches that information, so kernel does need to search whole SADB for each packet and this is the reason why usage of old SA is observed. This is my theory only, someone who wrote XFRM probably knows this for sure. Best regards, Krzysztof Olędzki
Re: [PATCH 1/1] [NETFILTER] ip_conntrack: fix ftp/irc/tftp helpers on ports = 32768
On Thu, 17 Nov 2005, David S. Miller wrote: From: Harald Welte [EMAIL PROTECTED] Date: Tue, 15 Nov 2005 11:03:51 +0100 [NETFILTER] ip_conntrack: fix ftp/irc/tftp helpers on ports = 32768 Since we've converted the ftp/irc/tftp helpers to use the new module_parm_array() some time ago, we ware accidentially using signed data types - thus preventing those modules from being used on ports = 32768. This patch fixes it by using 'ushort' module parameters. Thanks to Jan Nijs for reporting this bug. Signed-off-by: Harald Welte [EMAIL PROTECTED] Applied, thanks. I think this is definitely a 2.6.14-stable candidate? What about patch that fixes vlan with bonding? http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=8e3babcd69ec0fde874838e276eb0b211c6a5647 I think we should fix this in 2.6.14-stable also. Best regards, Krzysztof Olędzki
Re: [PATCH 1/1] [NETFILTER] ip_conntrack: fix ftp/irc/tftp helpers on ports = 32768
On Fri, 18 Nov 2005, David S. Miller wrote: From: Krzysztof Oledzki [EMAIL PROTECTED] Date: Fri, 18 Nov 2005 09:43:27 +0100 (CET) What about patch that fixes vlan with bonding? http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=8e3babcd69ec0fde874838e276eb0b211c6a5647 I think we should fix this in 2.6.14-stable also. That's network device stuff, ask Jeff Garzik and the patch submitter, I had already did that (15 Nov 2005) but it seems I was ignored. not me. OK, please excuse me. Best regards, Krzysztof Olędzki
Re: [PATCH 2.6.14] bonding: fix feature consolidation
On Fri, 4 Nov 2005, Jay Vosburgh wrote: This should resolve http://bugzilla.kernel.org/show_bug.cgi?id=5519 The current feature computation loses bits that it doesn't know about, resulting in an inability to add VLANs and possibly other havoc. Rewrote function to preserve bits it doesn't know about, remove an unneeded state variable, and simplify the code. -J Could we have this fix in next -stable for 2.6.14, please? Best regards, Krzysztof Olędzki
Re: [PATCH 2.6.14 0/18] Yet Another Bonding Sysfs patchset
On Wed, 9 Nov 2005, Mitch Williams wrote: Jay says he's finally ready to take this patch. So here we go again. Rebased against 2.6.14 final. Which turned out to be way more work than I expected. Is this patchset going to be included in 2.6.15? Best regards, Krzysztof Olędzki
Re: Fw: [Bugme-new] [Bug 5194] New: IPSec related OOps in 2.6.13
On Tue, 6 Sep 2005, Herbert Xu wrote: On Tue, Sep 06, 2005 at 04:08:56AM -0700, Andrew Morton wrote: Problem Description: Oops: [#1] PREEMPT Modules linked in: CPU:0 EIP:0060:[c01f562c]Not tainted VLI EFLAGS: 00010216 (2.6.13) EIP is at sha1_update+0x7c/0x160 Thanks for the report. Matt LaPlante had exactly the same problem a couple of days ago. I've tracked down now to my broken crypto cipher wrapper functions which will step over a page boundary if it's not aligned correctly. [CRYPTO] Fix boundary check in standard multi-block cipher processors Thanks. Patched my kernel, recompiled and waiting. So far it is OK, Should this patch be merged into 2.6.13.1? Best regards, Krzysztof Olędzki
Re: Fw: Re: [Bugme-new] [Bug 4952] New: IPSec incompabilty. Linux kernel waits to long to start using new SA for outbound traffic.
On Tue, 2 Aug 2005, Patrick McHardy wrote: Krzysztof Oledzki wrote: On Mon, 1 Aug 2005, Herbert Xu wrote: On Mon, Aug 01, 2005 at 05:46:26AM +0200, Krzysztof Oledzki wrote: Any new patches to test? ;) As I said in an earlier message, you should patch racoon to delete the old *outbound* SA when the new SA has been negotiated. Did not receive this one, sorry :(. However, the same question was asked to racoon developers and the answer was, that it is kernel job. They even pointed that KAME IPSec stack can be tuned to (or not to) prefer old SA. The kernel's job is to use a valid SA. Again... RFC 2408 says: A protocol implementation SHOULD begin using the newly created SA for outbound traffic and SHOULD continue to support incoming traffic on the old SA until it is deleted or until traffic is received under the protection of the newly created SA. - Section 4.3. In this case both are valid and the peer is buggy. The problem is the word SHOULD and IMHO both Linux and the peer are buggy. So I think the suggestion to work around this in the keying daemons is not unreasonable. There is no need to work around this on *BSD (KAME stack) and the keying daemon is exactly the same for both Linux and *BSD. Best regards, Krzysztof Olędzki
Re: [Bugme-new] [Bug 4952] New: IPSec incompabilty. Linux kernel waits to long to start using new SA for outbound traffic.
On Tue, 2 Aug 2005, Herbert Xu wrote: On Mon, Aug 01, 2005 at 10:41:33AM +0200, Krzysztof Oledzki wrote: RFC 2408 says: A protocol implementation SHOULD begin using the newly created SA for outbound traffic and SHOULD continue to support incoming traffic on the old SA until it is deleted or until traffic is received under the protection of the newly created SA. - Section 4.3. The problem is the word SHOULD and IMHO both Linux and peer are buggy. The protocol implementation is made up of a kernel component as well as a user-space component. IMHO this should be done where it's easiest. IMHO userland is not to supposed solve kernel issues. Best regards, Krzysztof Olędzki