Re: e1000: Detected Tx Unit Hang
Bernd Schubert wrote: On Saturday 16 February 2008, Kok, Auke wrote: Bernd Schubert wrote: Hello, I can't login to one of our servers and just got this in an ipmi sol session: [18169.209181] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang [18169.209183] Tx Queue 0 [18169.209184] TDH e3 [18169.209185] TDT e3 [18169.209186] next_to_use e3 [18169.209187] next_to_cleanbd [18169.209188] buffer_info[next_to_clean] [18169.209189] time_stamp 10043e4d2 [18169.209190] next_to_watchbe [18169.209191] jiffies 10043e6f6 [18169.209192] next_to_watch.status 1 [18169.256978] e1000: eth2: e1000_clean_tx_irq: Detected Tx Unit Hang [18169.256979] Tx Queue 0 [18169.256980] TDH de [18169.256982] TDT de [18169.256983] next_to_use de [18169.256984] next_to_cleanbc [18169.256985] buffer_info[next_to_clean] [18169.256986] time_stamp 10043e511 [18169.256987] next_to_watchbd [18169.256988] jiffies 10043e701 [18169.256989] next_to_watch.status 1 This is with 2.6.22.18. Is there any chance to recover the system? For some reasons I would prefer not to reboot now. if that's all you have then it was false alarm. there should be a 'netdev timeout - link reset' following those messages. can you send some more context on those messages? All I presently know is that there are 20 servers and login doesn't work any more - sysrq+t does show me it hangs in fuse, which is accessing the underlying nfs (we are using unionfs-fuse). While I checked the sysrq-t output suddenly these e1000 messages appeared. Thinking a bit about it, it either could be 2.6.22.18 has an e1000 bug, which 2.6.22.X didn't have (X=16, I think, but I'm not sure) or someone mis-configured the switch/network environment today. Hmm, now that I think about the last part, there already had been other networking problems today, which were supposed to be fixed several hours ago. Seems they didn't fix it properly. in real tx hang cases, the hardware is reset within 2 seconds, and everything continues as normal. Thanks, this gives me hope I don't need to reboot the serves (reboot would mean I would need to start 60 md-raid rebuilds...). my first thought after I read this e-mail is that the tx-hang message is just a symptom of your system not responding or being spinlocked all the time. These TX hang issues normally completely do not interfere with normal system operation and unless you have continuous TX resets you would be able to logon perfectly fine. I think you might have hit another kernel bug here... perhaps even unionfs/fuse related and that certainly looks plausible from your problem description. looking at the changelog for 2.6.22.16-2.6.22.18 I can't see anything relevant (see http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.22.y.git;a=shortlog), but there are definately no e1000 driver changes in that range anyway. I don't suppose you can do a git-bisect? that would certainly help. I don't think we can rule out anything just yet here. At least try to revert some of your systems to the previous kernel version and see if the problem goes away... Auke -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
e1000: Detected Tx Unit Hang
Hello, I can't login to one of our servers and just got this in an ipmi sol session: [18169.209181] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang [18169.209183] Tx Queue 0 [18169.209184] TDH e3 [18169.209185] TDT e3 [18169.209186] next_to_use e3 [18169.209187] next_to_cleanbd [18169.209188] buffer_info[next_to_clean] [18169.209189] time_stamp 10043e4d2 [18169.209190] next_to_watchbe [18169.209191] jiffies 10043e6f6 [18169.209192] next_to_watch.status 1 [18169.256978] e1000: eth2: e1000_clean_tx_irq: Detected Tx Unit Hang [18169.256979] Tx Queue 0 [18169.256980] TDH de [18169.256982] TDT de [18169.256983] next_to_use de [18169.256984] next_to_cleanbc [18169.256985] buffer_info[next_to_clean] [18169.256986] time_stamp 10043e511 [18169.256987] next_to_watchbd [18169.256988] jiffies 10043e701 [18169.256989] next_to_watch.status 1 This is with 2.6.22.18. Is there any chance to recover the system? For some reasons I would prefer not to reboot now. Thanks, Bernd -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000: Detected Tx Unit Hang
Bernd Schubert wrote: Hello, I can't login to one of our servers and just got this in an ipmi sol session: [18169.209181] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang [18169.209183] Tx Queue 0 [18169.209184] TDH e3 [18169.209185] TDT e3 [18169.209186] next_to_use e3 [18169.209187] next_to_cleanbd [18169.209188] buffer_info[next_to_clean] [18169.209189] time_stamp 10043e4d2 [18169.209190] next_to_watchbe [18169.209191] jiffies 10043e6f6 [18169.209192] next_to_watch.status 1 [18169.256978] e1000: eth2: e1000_clean_tx_irq: Detected Tx Unit Hang [18169.256979] Tx Queue 0 [18169.256980] TDH de [18169.256982] TDT de [18169.256983] next_to_use de [18169.256984] next_to_cleanbc [18169.256985] buffer_info[next_to_clean] [18169.256986] time_stamp 10043e511 [18169.256987] next_to_watchbd [18169.256988] jiffies 10043e701 [18169.256989] next_to_watch.status 1 This is with 2.6.22.18. Is there any chance to recover the system? For some reasons I would prefer not to reboot now. if that's all you have then it was false alarm. there should be a 'netdev timeout - link reset' following those messages. can you send some more context on those messages? in real tx hang cases, the hardware is reset within 2 seconds, and everything continues as normal. Auke -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000: Detected Tx Unit Hang
On Saturday 16 February 2008, Kok, Auke wrote: Bernd Schubert wrote: Hello, I can't login to one of our servers and just got this in an ipmi sol session: [18169.209181] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang [18169.209183] Tx Queue 0 [18169.209184] TDH e3 [18169.209185] TDT e3 [18169.209186] next_to_use e3 [18169.209187] next_to_cleanbd [18169.209188] buffer_info[next_to_clean] [18169.209189] time_stamp 10043e4d2 [18169.209190] next_to_watchbe [18169.209191] jiffies 10043e6f6 [18169.209192] next_to_watch.status 1 [18169.256978] e1000: eth2: e1000_clean_tx_irq: Detected Tx Unit Hang [18169.256979] Tx Queue 0 [18169.256980] TDH de [18169.256982] TDT de [18169.256983] next_to_use de [18169.256984] next_to_cleanbc [18169.256985] buffer_info[next_to_clean] [18169.256986] time_stamp 10043e511 [18169.256987] next_to_watchbd [18169.256988] jiffies 10043e701 [18169.256989] next_to_watch.status 1 This is with 2.6.22.18. Is there any chance to recover the system? For some reasons I would prefer not to reboot now. if that's all you have then it was false alarm. there should be a 'netdev timeout - link reset' following those messages. can you send some more context on those messages? All I presently know is that there are 20 servers and login doesn't work any more - sysrq+t does show me it hangs in fuse, which is accessing the underlying nfs (we are using unionfs-fuse). While I checked the sysrq-t output suddenly these e1000 messages appeared. Thinking a bit about it, it either could be 2.6.22.18 has an e1000 bug, which 2.6.22.X didn't have (X=16, I think, but I'm not sure) or someone mis-configured the switch/network environment today. Hmm, now that I think about the last part, there already had been other networking problems today, which were supposed to be fixed several hours ago. Seems they didn't fix it properly. in real tx hang cases, the hardware is reset within 2 seconds, and everything continues as normal. Thanks, this gives me hope I don't need to reboot the serves (reboot would mean I would need to start 60 md-raid rebuilds...). Thanks, Bernd -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
From: Robert Olsson [EMAIL PROTECTED] Date: Mon, 21 Jan 2008 14:27:13 +0100 Yes it works. e1000 tested for ~3 hours with high very high load and interface up/down every 5:th sec. Without the patch the irq's gets disabled within a couple of seconds A resolute way of handling the semaphores. :) Signed-off-by: Robert Olsson [EMAIL PROTECTED] Thanks for testing Robert. I sent off that fix to Linus an hour or so ago, hopefully he will pick it up some time today. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
David Miller writes: Yes, this semaphore thing is highly problematic. In the most crucial areas where network driver consistency matters the most for ease of understanding and debugging, the Intel drivers choose to be different :-( The way the napi_disable() logic breaks out from high packet load in net_rx_action() is it simply returns even leaving interrupts disabled when a pending napi_disable() is pending. This is what trips up the semaphore logic. Robert, give this patch a try. Yes it works. e1000 tested for ~3 hours with high very high load and interface up/down every 5:th sec. Without the patch the irq's gets disabled within a couple of seconds A resolute way of handling the semaphores. :) Signed-off-by: Robert Olsson [EMAIL PROTECTED] Cheers --ro In the long term this semaphore should be completely eliminated, there is no justification for it. Signed-off-by: David S. Miller [EMAIL PROTECTED] diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c index 0c9a6f7..76c0fa6 100644 --- a/drivers/net/e1000/e1000_main.c +++ b/drivers/net/e1000/e1000_main.c @@ -632,6 +632,7 @@ e1000_down(struct e1000_adapter *adapter) #ifdef CONFIG_E1000_NAPI napi_disable(adapter-napi); +atomic_set(adapter-irq_sem, 0); #endif e1000_irq_disable(adapter); diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c index 2ab3bfb..9cc5a6b 100644 --- a/drivers/net/e1000e/netdev.c +++ b/drivers/net/e1000e/netdev.c @@ -2183,6 +2183,7 @@ void e1000e_down(struct e1000_adapter *adapter) msleep(10); napi_disable(adapter-napi); +atomic_set(adapter-irq_sem, 0); e1000_irq_disable(adapter); del_timer_sync(adapter-watchdog_timer); diff --git a/drivers/net/ixgb/ixgb_main.c b/drivers/net/ixgb/ixgb_main.c index d2fb88d..4f63839 100644 --- a/drivers/net/ixgb/ixgb_main.c +++ b/drivers/net/ixgb/ixgb_main.c @@ -296,6 +296,11 @@ ixgb_down(struct ixgb_adapter *adapter, boolean_t kill_watchdog) { struct net_device *netdev = adapter-netdev; +#ifdef CONFIG_IXGB_NAPI +napi_disable(adapter-napi); +atomic_set(adapter-irq_sem, 0); +#endif + ixgb_irq_disable(adapter); free_irq(adapter-pdev-irq, netdev); @@ -304,9 +309,7 @@ ixgb_down(struct ixgb_adapter *adapter, boolean_t kill_watchdog) if(kill_watchdog) del_timer_sync(adapter-watchdog_timer); -#ifdef CONFIG_IXGB_NAPI -napi_disable(adapter-napi); -#endif + adapter-link_speed = 0; adapter-link_duplex = 0; netif_carrier_off(netdev); diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c index de3f45e..a4265bc 100644 --- a/drivers/net/ixgbe/ixgbe_main.c +++ b/drivers/net/ixgbe/ixgbe_main.c @@ -1409,9 +1409,11 @@ void ixgbe_down(struct ixgbe_adapter *adapter) IXGBE_WRITE_FLUSH(adapter-hw); msleep(10); +napi_disable(adapter-napi); +atomic_set(adapter-irq_sem, 0); + ixgbe_irq_disable(adapter); -napi_disable(adapter-napi); del_timer_sync(adapter-watchdog_timer); netif_carrier_off(netdev); -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
David Miller wrote: From: Robert Olsson [EMAIL PROTECTED] Date: Fri, 18 Jan 2008 14:00:57 +0100 I don't understand the idea with semaphore for enabling/disabling irq's either the overall logic must safer/better without it. They must have had code paths where they didn't know if IRQs were enabled or not already, so they tried to create something which approximates the: local_irq_save(flags); local_irq_restore(flags); constructs we have for CPU interrupts, so they could go: e1000_irq_disable(); /* ... */ e1000_irq_enable(); and this would work even if the caller was running with e1000 interrupts disabled already. Or, something like that... it is indeed confusing. Anyways, yes it's totally bogus and should be removed. I agree, bogus, and in fact I've already removed it from our development version of ixgbe. Right now I wanted to report I can't remove e1000 at all on 2.6.24-rc8+git I continually get the kernel: unregister_netdevice: waiting for eth2 to become free. Usage count = 1 Where 2.6.24-rc5 e1000 is okay still. Seems like maybe we are still missing a netif_rx_complete or a napi_disable somewhere. I don't think this problem has anything to do with the irq_sem right now. Something is still badly broken. I am just using the interface regularly (no heavy load) and I can't unload the module. Jesse -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
On Sun, Jan 20, 2008 at 01:20:11AM -0800, Brandeburg, Jesse wrote: I continually get the kernel: unregister_netdevice: waiting for eth2 to become free. Usage count = 1 http://bugzilla.kernel.org/show_bug.cgi?id=9778 -- WBR, wRAR (ALT Linux Team) signature.asc Description: Digital signature
Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
Hello. Its work, thanks for resend it! Sorry, i understand that patch 53e52c729cc169db82a6105fac7a166e10c2ec36 ([NET]: Make -poll() breakout consistent in Intel ethernet drivers.) have regression and rollback it, i not see your patch. Sorry again. Thanks! From: Badalian Vyacheslav [EMAIL PROTECTED] Date: Wed, 16 Jan 2008 12:02:28 +0300 Also have regression after apply patch. BTW, if you are using the e1000e driver then this initial patch will not work. My more recent patch posting for this problem, will. I include it again below for you: [NET]: Fix TX timeout regression in Intel drivers. This fixes a regression added by changeset 53e52c729cc169db82a6105fac7a166e10c2ec36 ([NET]: Make -poll() breakout consistent in Intel ethernet drivers.) As pointed out by Jesse Brandeburg, for three of the drivers edited above there is breakout logic in the *_clean_tx_irq() code to prevent running TX reclaim forever. If this occurs, we have to elide NAPI poll completion or else those TX events will never be serviced. Signed-off-by: David S. Miller [EMAIL PROTECTED] diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c index 13d57b0..0c9a6f7 100644 --- a/drivers/net/e1000/e1000_main.c +++ b/drivers/net/e1000/e1000_main.c @@ -3919,7 +3919,7 @@ e1000_clean(struct napi_struct *napi, int budget) { struct e1000_adapter *adapter = container_of(napi, struct e1000_adapter, napi); struct net_device *poll_dev = adapter-netdev; - int work_done = 0; + int tx_cleaned = 0, work_done = 0; /* Must NOT use netdev_priv macro here. */ adapter = poll_dev-priv; @@ -3929,14 +3929,17 @@ e1000_clean(struct napi_struct *napi, int budget) * simultaneously. A failure obtaining the lock means * tx_ring[0] is currently being cleaned anyway. */ if (spin_trylock(adapter-tx_queue_lock)) { - e1000_clean_tx_irq(adapter, - adapter-tx_ring[0]); + tx_cleaned = e1000_clean_tx_irq(adapter, + adapter-tx_ring[0]); spin_unlock(adapter-tx_queue_lock); } adapter-clean_rx(adapter, adapter-rx_ring[0], work_done, budget); + if (tx_cleaned) + work_done = budget; + /* If budget not fully consumed, exit the polling mode */ if (work_done budget) { if (likely(adapter-itr_setting 3)) diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c index 4a6fc74..2ab3bfb 100644 --- a/drivers/net/e1000e/netdev.c +++ b/drivers/net/e1000e/netdev.c @@ -1384,7 +1384,7 @@ static int e1000_clean(struct napi_struct *napi, int budget) { struct e1000_adapter *adapter = container_of(napi, struct e1000_adapter, napi); struct net_device *poll_dev = adapter-netdev; - int work_done = 0; + int tx_cleaned = 0, work_done = 0; /* Must NOT use netdev_priv macro here. */ adapter = poll_dev-priv; @@ -1394,12 +1394,15 @@ static int e1000_clean(struct napi_struct *napi, int budget) * simultaneously. A failure obtaining the lock means * tx_ring is currently being cleaned anyway. */ if (spin_trylock(adapter-tx_queue_lock)) { - e1000_clean_tx_irq(adapter); + tx_cleaned = e1000_clean_tx_irq(adapter); spin_unlock(adapter-tx_queue_lock); } adapter-clean_rx(adapter, work_done, budget); + if (tx_cleaned) + work_done = budget; + /* If budget not fully consumed, exit the polling mode */ if (work_done budget) { if (adapter-itr_setting 3) diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c index a564916..de3f45e 100644 --- a/drivers/net/ixgbe/ixgbe_main.c +++ b/drivers/net/ixgbe/ixgbe_main.c @@ -1468,13 +1468,16 @@ static int ixgbe_clean(struct napi_struct *napi, int budget) struct ixgbe_adapter *adapter = container_of(napi, struct ixgbe_adapter, napi); struct net_device *netdev = adapter-netdev; - int work_done = 0; + int tx_cleaned = 0, work_done = 0; /* In non-MSIX case, there is no multi-Tx/Rx queue */ - ixgbe_clean_tx_irq(adapter, adapter-tx_ring); + tx_cleaned = ixgbe_clean_tx_irq(adapter, adapter-tx_ring); ixgbe_clean_rx_irq(adapter, adapter-rx_ring[0], work_done, budget); + if (tx_cleaned) + work_done = budget; + /* If budget not fully consumed, exit the polling mode */ if (work_done budget) { netif_rx_complete(netdev, napi); -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
From: Robert Olsson [EMAIL PROTECTED] Date: Wed, 16 Jan 2008 18:07:38 +0100 eth0 e1000_irq_enable sem = 1- High netload eth0 e1000_irq_enable sem = 1 eth0 e1000_irq_enable sem = 1 eth0 e1000_irq_enable sem = 1 eth0 e1000_irq_enable sem = 1 eth0 e1000_irq_enable sem = 1 eth0 e1000_irq_enable sem = 1- ifconfig eth0 down eth0 e1000_irq_disable sem = 2 **e1000_open - ifconfig eth0 up eth0 e1000_irq_disable sem = 3 Dead. irq's can't be enabled e1000_irq_enable miss eth0 e1000_irq_enable sem = 2 e1000_irq_enable miss eth0 e1000_irq_enable sem = 1 ADDRCONF(NETDEV_UP): eth0: link is not ready Yes, this semaphore thing is highly problematic. In the most crucial areas where network driver consistency matters the most for ease of understanding and debugging, the Intel drivers choose to be different :-( The way the napi_disable() logic breaks out from high packet load in net_rx_action() is it simply returns even leaving interrupts disabled when a pending napi_disable() is pending. This is what trips up the semaphore logic. Robert, give this patch a try. In the long term this semaphore should be completely eliminated, there is no justification for it. Signed-off-by: David S. Miller [EMAIL PROTECTED] diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c index 0c9a6f7..76c0fa6 100644 --- a/drivers/net/e1000/e1000_main.c +++ b/drivers/net/e1000/e1000_main.c @@ -632,6 +632,7 @@ e1000_down(struct e1000_adapter *adapter) #ifdef CONFIG_E1000_NAPI napi_disable(adapter-napi); + atomic_set(adapter-irq_sem, 0); #endif e1000_irq_disable(adapter); diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c index 2ab3bfb..9cc5a6b 100644 --- a/drivers/net/e1000e/netdev.c +++ b/drivers/net/e1000e/netdev.c @@ -2183,6 +2183,7 @@ void e1000e_down(struct e1000_adapter *adapter) msleep(10); napi_disable(adapter-napi); + atomic_set(adapter-irq_sem, 0); e1000_irq_disable(adapter); del_timer_sync(adapter-watchdog_timer); diff --git a/drivers/net/ixgb/ixgb_main.c b/drivers/net/ixgb/ixgb_main.c index d2fb88d..4f63839 100644 --- a/drivers/net/ixgb/ixgb_main.c +++ b/drivers/net/ixgb/ixgb_main.c @@ -296,6 +296,11 @@ ixgb_down(struct ixgb_adapter *adapter, boolean_t kill_watchdog) { struct net_device *netdev = adapter-netdev; +#ifdef CONFIG_IXGB_NAPI + napi_disable(adapter-napi); + atomic_set(adapter-irq_sem, 0); +#endif + ixgb_irq_disable(adapter); free_irq(adapter-pdev-irq, netdev); @@ -304,9 +309,7 @@ ixgb_down(struct ixgb_adapter *adapter, boolean_t kill_watchdog) if(kill_watchdog) del_timer_sync(adapter-watchdog_timer); -#ifdef CONFIG_IXGB_NAPI - napi_disable(adapter-napi); -#endif + adapter-link_speed = 0; adapter-link_duplex = 0; netif_carrier_off(netdev); diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c index de3f45e..a4265bc 100644 --- a/drivers/net/ixgbe/ixgbe_main.c +++ b/drivers/net/ixgbe/ixgbe_main.c @@ -1409,9 +1409,11 @@ void ixgbe_down(struct ixgbe_adapter *adapter) IXGBE_WRITE_FLUSH(adapter-hw); msleep(10); + napi_disable(adapter-napi); + atomic_set(adapter-irq_sem, 0); + ixgbe_irq_disable(adapter); - napi_disable(adapter-napi); del_timer_sync(adapter-watchdog_timer); netif_carrier_off(netdev); -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
From: Robert Olsson [EMAIL PROTECTED] Date: Fri, 18 Jan 2008 14:00:57 +0100 I don't understand the idea with semaphore for enabling/disabling irq's either the overall logic must safer/better without it. They must have had code paths where they didn't know if IRQs were enabled or not already, so they tried to create something which approximates the: local_irq_save(flags); local_irq_restore(flags); constructs we have for CPU interrupts, so they could go: e1000_irq_disable(); /* ... */ e1000_irq_enable(); and this would work even if the caller was running with e1000 interrupts disabled already. Or, something like that... it is indeed confusing. Anyways, yes it's totally bogus and should be removed. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
David Miller writes: eth0 e1000_irq_enable sem = 1- ifconfig eth0 down eth0 e1000_irq_disable sem = 2 **e1000_open - ifconfig eth0 up eth0 e1000_irq_disable sem = 3 Dead. irq's can't be enabled e1000_irq_enable miss eth0 e1000_irq_enable sem = 2 e1000_irq_enable miss eth0 e1000_irq_enable sem = 1 ADDRCONF(NETDEV_UP): eth0: link is not ready Yes, this semaphore thing is highly problematic. In the most crucial areas where network driver consistency matters the most for ease of understanding and debugging, the Intel drivers choose to be different I don't understand the idea with semaphore for enabling/disabling irq's either the overall logic must safer/better without it. The way the napi_disable() logic breaks out from high packet load in net_rx_action() is it simply returns even leaving interrupts disabled when a pending napi_disable() is pending. This is what trips up the semaphore logic. Robert, give this patch a try. In the long term this semaphore should be completely eliminated, there is no justification for it. It's on the testing list... Cheers --ro Signed-off-by: David S. Miller [EMAIL PROTECTED] diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c index 0c9a6f7..76c0fa6 100644 --- a/drivers/net/e1000/e1000_main.c +++ b/drivers/net/e1000/e1000_main.c @@ -632,6 +632,7 @@ e1000_down(struct e1000_adapter *adapter) #ifdef CONFIG_E1000_NAPI napi_disable(adapter-napi); +atomic_set(adapter-irq_sem, 0); #endif e1000_irq_disable(adapter); diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c index 2ab3bfb..9cc5a6b 100644 --- a/drivers/net/e1000e/netdev.c +++ b/drivers/net/e1000e/netdev.c @@ -2183,6 +2183,7 @@ void e1000e_down(struct e1000_adapter *adapter) msleep(10); napi_disable(adapter-napi); +atomic_set(adapter-irq_sem, 0); e1000_irq_disable(adapter); del_timer_sync(adapter-watchdog_timer); diff --git a/drivers/net/ixgb/ixgb_main.c b/drivers/net/ixgb/ixgb_main.c index d2fb88d..4f63839 100644 --- a/drivers/net/ixgb/ixgb_main.c +++ b/drivers/net/ixgb/ixgb_main.c @@ -296,6 +296,11 @@ ixgb_down(struct ixgb_adapter *adapter, boolean_t kill_watchdog) { struct net_device *netdev = adapter-netdev; +#ifdef CONFIG_IXGB_NAPI +napi_disable(adapter-napi); +atomic_set(adapter-irq_sem, 0); +#endif + ixgb_irq_disable(adapter); free_irq(adapter-pdev-irq, netdev); @@ -304,9 +309,7 @@ ixgb_down(struct ixgb_adapter *adapter, boolean_t kill_watchdog) if(kill_watchdog) del_timer_sync(adapter-watchdog_timer); -#ifdef CONFIG_IXGB_NAPI -napi_disable(adapter-napi); -#endif + adapter-link_speed = 0; adapter-link_duplex = 0; netif_carrier_off(netdev); diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c index de3f45e..a4265bc 100644 --- a/drivers/net/ixgbe/ixgbe_main.c +++ b/drivers/net/ixgbe/ixgbe_main.c @@ -1409,9 +1409,11 @@ void ixgbe_down(struct ixgbe_adapter *adapter) IXGBE_WRITE_FLUSH(adapter-hw); msleep(10); +napi_disable(adapter-napi); +atomic_set(adapter-irq_sem, 0); + ixgbe_irq_disable(adapter); -napi_disable(adapter-napi); del_timer_sync(adapter-watchdog_timer); netif_carrier_off(netdev); -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
From: Frans Pop [EMAIL PROTECTED] Date: Thu, 17 Jan 2008 08:51:55 +0100 On Thursday 17 January 2008, David Miller wrote: From: Brandeburg, Jesse [EMAIL PROTECTED] We spent Wednesday trying to reproduce (without the patch) these issues without much luck, and have applied the patch cleanly and will continue testing it. Given the simplicity of the changes, and the community testing, I'll give my ack and we will continue testing. You need a slow CPU, and you need to make sure you do actually trigger the TX limiting code there. Hmmm. Is a dual core Pentium D 3.20GHz considered slow these days? No of course :-) I guess it therefore depends upon the load as well. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
Em Thu, Jan 17, 2008 at 12:00:02AM -0800, David Miller escreveu: From: Frans Pop [EMAIL PROTECTED] Date: Thu, 17 Jan 2008 08:51:55 +0100 On Thursday 17 January 2008, David Miller wrote: From: Brandeburg, Jesse [EMAIL PROTECTED] We spent Wednesday trying to reproduce (without the patch) these issues without much luck, and have applied the patch cleanly and will continue testing it. Given the simplicity of the changes, and the community testing, I'll give my ack and we will continue testing. You need a slow CPU, and you need to make sure you do actually trigger the TX limiting code there. Hmmm. Is a dual core Pentium D 3.20GHz considered slow these days? No of course :-) I guess it therefore depends upon the load as well. I saw it just once, yesterday: [EMAIL PROTECTED] ~]# uname -r 2.6.24-rc5 e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang Tx Queue 0 TDH 58 TDT 8f next_to_use 8f next_to_clean55 buffer_info[next_to_clean] time_stamp 105e973a9 next_to_watch56 jiffies 105e97992 next_to_watch.status 1 [EMAIL PROTECTED] ~]# on a lenovo T60W, core2duo machine (2GHz), when using it to stress test another machine, I was using netperf TCP_STREAM ranging from 1 to 8 streams + a ping -f using various packet sizes. I'll update this machine today to 2.6.24-rc8-git + net-2.6 and try again to reproduce. I also applied David's patch while trying some RT experiments on another, 8 way machine used as a server, but on this machine I didn't experience the Tx Unit Hang message with or without the patch. - Arnaldo -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
From: Arnaldo Carvalho de Melo [EMAIL PROTECTED] Date: Thu, 17 Jan 2008 07:40:07 -0200 I'll update this machine today to 2.6.24-rc8-git + net-2.6 and try again to reproduce. Thanks for the datapoints and testing. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
On Wednesday 16 January 2008, David Miller wrote: Ok, here is the patch I'll propose to fix this. The goal is to make it as simple as possible without regressing the thing we were trying to fix. Looks good to me. Tested with -rc8. Cheers, FJP -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
applied to 2.6.24-rc7-git2 Have messages Also have regression after apply patch. System may do above 800mbs traffic before patch. After its exit polling mode? (4 CPU, 1 cpu get 100% si (process ksoftirqd/0), 3 CPU is IDLE) After patch system was go to exit polling mode at above 600mbs. Thanks. From: Frans Pop [EMAIL PROTECTED] Date: Tue, 15 Jan 2008 06:25:10 +0100 kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang Does this make the problem go away? (Note this isn't the final correct patch we should apply. There is no reason why this revert back to the older -poll() logic here should have any effect on the TX hang triggering...) diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c index 13d57b0..cada32c 100644 --- a/drivers/net/e1000/e1000_main.c +++ b/drivers/net/e1000/e1000_main.c @@ -3919,7 +3919,7 @@ e1000_clean(struct napi_struct *napi, int budget) { struct e1000_adapter *adapter = container_of(napi, struct e1000_adapter, napi); struct net_device *poll_dev = adapter-netdev; - int work_done = 0; + int tx_work = 0, work_done = 0; /* Must NOT use netdev_priv macro here. */ adapter = poll_dev-priv; @@ -3929,8 +3929,8 @@ e1000_clean(struct napi_struct *napi, int budget) * simultaneously. A failure obtaining the lock means * tx_ring[0] is currently being cleaned anyway. */ if (spin_trylock(adapter-tx_queue_lock)) { - e1000_clean_tx_irq(adapter, - adapter-tx_ring[0]); + tx_work = e1000_clean_tx_irq(adapter, +adapter-tx_ring[0]); spin_unlock(adapter-tx_queue_lock); } @@ -3938,7 +3938,7 @@ e1000_clean(struct napi_struct *napi, int budget) work_done, budget); /* If budget not fully consumed, exit the polling mode */ - if (work_done budget) { + if (!tx_work (work_done budget)) { if (likely(adapter-itr_setting 3)) e1000_set_itr(adapter); netif_rx_complete(poll_dev, napi); -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
From: Frans Pop [EMAIL PROTECTED] Date: Wed, 16 Jan 2008 09:56:08 +0100 On Wednesday 16 January 2008, David Miller wrote: Ok, here is the patch I'll propose to fix this. The goal is to make it as simple as possible without regressing the thing we were trying to fix. Looks good to me. Tested with -rc8. Thanks for testing. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
From: Badalian Vyacheslav [EMAIL PROTECTED] Date: Wed, 16 Jan 2008 12:02:28 +0300 Also have regression after apply patch. BTW, if you are using the e1000e driver then this initial patch will not work. My more recent patch posting for this problem, will. I include it again below for you: [NET]: Fix TX timeout regression in Intel drivers. This fixes a regression added by changeset 53e52c729cc169db82a6105fac7a166e10c2ec36 ([NET]: Make -poll() breakout consistent in Intel ethernet drivers.) As pointed out by Jesse Brandeburg, for three of the drivers edited above there is breakout logic in the *_clean_tx_irq() code to prevent running TX reclaim forever. If this occurs, we have to elide NAPI poll completion or else those TX events will never be serviced. Signed-off-by: David S. Miller [EMAIL PROTECTED] diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c index 13d57b0..0c9a6f7 100644 --- a/drivers/net/e1000/e1000_main.c +++ b/drivers/net/e1000/e1000_main.c @@ -3919,7 +3919,7 @@ e1000_clean(struct napi_struct *napi, int budget) { struct e1000_adapter *adapter = container_of(napi, struct e1000_adapter, napi); struct net_device *poll_dev = adapter-netdev; - int work_done = 0; + int tx_cleaned = 0, work_done = 0; /* Must NOT use netdev_priv macro here. */ adapter = poll_dev-priv; @@ -3929,14 +3929,17 @@ e1000_clean(struct napi_struct *napi, int budget) * simultaneously. A failure obtaining the lock means * tx_ring[0] is currently being cleaned anyway. */ if (spin_trylock(adapter-tx_queue_lock)) { - e1000_clean_tx_irq(adapter, - adapter-tx_ring[0]); + tx_cleaned = e1000_clean_tx_irq(adapter, + adapter-tx_ring[0]); spin_unlock(adapter-tx_queue_lock); } adapter-clean_rx(adapter, adapter-rx_ring[0], work_done, budget); + if (tx_cleaned) + work_done = budget; + /* If budget not fully consumed, exit the polling mode */ if (work_done budget) { if (likely(adapter-itr_setting 3)) diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c index 4a6fc74..2ab3bfb 100644 --- a/drivers/net/e1000e/netdev.c +++ b/drivers/net/e1000e/netdev.c @@ -1384,7 +1384,7 @@ static int e1000_clean(struct napi_struct *napi, int budget) { struct e1000_adapter *adapter = container_of(napi, struct e1000_adapter, napi); struct net_device *poll_dev = adapter-netdev; - int work_done = 0; + int tx_cleaned = 0, work_done = 0; /* Must NOT use netdev_priv macro here. */ adapter = poll_dev-priv; @@ -1394,12 +1394,15 @@ static int e1000_clean(struct napi_struct *napi, int budget) * simultaneously. A failure obtaining the lock means * tx_ring is currently being cleaned anyway. */ if (spin_trylock(adapter-tx_queue_lock)) { - e1000_clean_tx_irq(adapter); + tx_cleaned = e1000_clean_tx_irq(adapter); spin_unlock(adapter-tx_queue_lock); } adapter-clean_rx(adapter, work_done, budget); + if (tx_cleaned) + work_done = budget; + /* If budget not fully consumed, exit the polling mode */ if (work_done budget) { if (adapter-itr_setting 3) diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c index a564916..de3f45e 100644 --- a/drivers/net/ixgbe/ixgbe_main.c +++ b/drivers/net/ixgbe/ixgbe_main.c @@ -1468,13 +1468,16 @@ static int ixgbe_clean(struct napi_struct *napi, int budget) struct ixgbe_adapter *adapter = container_of(napi, struct ixgbe_adapter, napi); struct net_device *netdev = adapter-netdev; - int work_done = 0; + int tx_cleaned = 0, work_done = 0; /* In non-MSIX case, there is no multi-Tx/Rx queue */ - ixgbe_clean_tx_irq(adapter, adapter-tx_ring); + tx_cleaned = ixgbe_clean_tx_irq(adapter, adapter-tx_ring); ixgbe_clean_rx_irq(adapter, adapter-rx_ring[0], work_done, budget); + if (tx_cleaned) + work_done = budget; + /* If budget not fully consumed, exit the polling mode */ if (work_done budget) { netif_rx_complete(netdev, napi); -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
From: Badalian Vyacheslav [EMAIL PROTECTED] Date: Wed, 16 Jan 2008 12:02:28 +0300 applied to 2.6.24-rc7-git2 Have messages Also have regression after apply patch. System may do above 800mbs traffic before patch. After its exit polling mode? (4 CPU, 1 cpu get 100% si (process ksoftirqd/0), 3 CPU is IDLE) After patch system was go to exit polling mode at above 600mbs. What do you mean by 'system was go to exit polling mode'? Please be more clear about your situation, in particular provide every detail about what happens so that we can properly debug this. THanks. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
David Miller writes: On Wednesday 16 January 2008, David Miller wrote: Ok, here is the patch I'll propose to fix this. The goal is to make it as simple as possible without regressing the thing we were trying to fix. Looks good to me. Tested with -rc8. Thanks for testing. Yes that code looks nice. I'm using the patch but I've noticed another phenomena with the current e1000 driver. There is a race when taking a device down at high traffic loads. I've tracked and instrumented and it seems like occasionly irq_sem can get bump up so interrupts can't be enabled again. eth0 e1000_irq_enable sem = 1- High netload eth0 e1000_irq_enable sem = 1 eth0 e1000_irq_enable sem = 1 eth0 e1000_irq_enable sem = 1 eth0 e1000_irq_enable sem = 1 eth0 e1000_irq_enable sem = 1 eth0 e1000_irq_enable sem = 1- ifconfig eth0 down eth0 e1000_irq_disable sem = 2 **e1000_open - ifconfig eth0 up eth0 e1000_irq_disable sem = 3 Dead. irq's can't be enabled e1000_irq_enable miss eth0 e1000_irq_enable sem = 2 e1000_irq_enable miss eth0 e1000_irq_enable sem = 1 ADDRCONF(NETDEV_UP): eth0: link is not ready Cheers --ro static void e1000_irq_disable(struct e1000_adapter *adapter) { atomic_inc(adapter-irq_sem); E1000_WRITE_REG(adapter-hw, IMC, ~0); E1000_WRITE_FLUSH(adapter-hw); synchronize_irq(adapter-pdev-irq); if(adapter-netdev-ifindex == 3) printk(%s e1000_irq_disable sem = %d\n, adapter-netdev-name, atomic_read(adapter-irq_sem)); } static void e1000_irq_enable(struct e1000_adapter *adapter) { if (likely(atomic_dec_and_test(adapter-irq_sem))) { E1000_WRITE_REG(adapter-hw, IMS, IMS_ENABLE_MASK); E1000_WRITE_FLUSH(adapter-hw); } else printk(e1000_irq_enable miss\n); if(adapter-netdev-ifindex == 3) printk(%s e1000_irq_enable sem = %d\n, adapter-netdev-name, atomic_read(adapter-irq_sem)); } -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
David Miller wrote: From: Brandeburg, Jesse [EMAIL PROTECTED] Date: Tue, 15 Jan 2008 13:53:43 -0800 The tx code has an early exit that tries to limit the amount of tx packets handled in a single poll loop and requires napi or interrupt rescheduling based on the return value from e1000_clean_tx_irq. That explains everything, thanks Jesse. Ok, here is the patch I'll propose to fix this. The goal is to make it as simple as possible without regressing the thing we were trying to fix. We spent Wednesday trying to reproduce (without the patch) these issues without much luck, and have applied the patch cleanly and will continue testing it. Given the simplicity of the changes, and the community testing, I'll give my ack and we will continue testing. I think we should fix Robert's (unrelated, but in this thread) reported issue before 2.6.24 final if we can, and I'll look at that tonight and tomorrow. Thanks for your work on this Dave, Jesse Acked-by: Jesse Brandeburg [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
From: Brandeburg, Jesse [EMAIL PROTECTED] Date: Wed, 16 Jan 2008 23:09:47 -0800 We spent Wednesday trying to reproduce (without the patch) these issues without much luck, and have applied the patch cleanly and will continue testing it. Given the simplicity of the changes, and the community testing, I'll give my ack and we will continue testing. You need a slow CPU, and you need to make sure you do actually trigger the TX limiting code there. I bet your cpus are fast enough that it simply never triggers. :-) Acked-by: Jesse Brandeburg [EMAIL PROTECTED] Thanks for reviewing Jesse. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
On Thursday 17 January 2008, David Miller wrote: From: Brandeburg, Jesse [EMAIL PROTECTED] We spent Wednesday trying to reproduce (without the patch) these issues without much luck, and have applied the patch cleanly and will continue testing it. Given the simplicity of the changes, and the community testing, I'll give my ack and we will continue testing. You need a slow CPU, and you need to make sure you do actually trigger the TX limiting code there. Hmmm. Is a dual core Pentium D 3.20GHz considered slow these days? -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
On Tuesday 15 January 2008, David Miller wrote: From: Frans Pop [EMAIL PROTECTED] kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang Does this make the problem go away? Yes, it very much looks like that solves it. I ran with the patch for 6 hours or so without any errors. I then switched back to an unpatched kernel and they reappeared immediately. (Note this isn't the final correct patch we should apply. There is no reason why this revert back to the older -poll() logic here should have any effect on the TX hang triggering...) s/no reason/no obvious reason/ ? ;-) Cheers, FJP -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
Quoting Frans Pop [EMAIL PROTECTED]: On Tuesday 15 January 2008, David Miller wrote: From: Frans Pop [EMAIL PROTECTED] kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang Does this make the problem go away? Yes, it very much looks like that solves it. I ran with the patch for 6 hours or so without any errors. I then switched back to an unpatched kernel and they reappeared immediately. (Note this isn't the final correct patch we should apply. There is no reason why this revert back to the older -poll() logic here should have any effect on the TX hang triggering...) s/no reason/no obvious reason/ ? ;-) Cheers, FJP -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Hello. I also try your patch (apply to 2.6.24-rc7-git2) I catch this message in dmesg [ 1771.796954] e1000: eth1: e1000_clean_tx_irq: Detected Tx Unit Hang [ 1771.796957] Tx Queue 0 [ 1771.796958] TDH 54 [ 1771.796959] TDT 54 [ 1771.796960] next_to_use 54 [ 1771.796961] next_to_cleana9 [ 1771.796962] buffer_info[next_to_clean] [ 1771.796963] time_stamp 14d72e [ 1771.796964] next_to_watcha9 [ 1771.796965] jiffies 14ddd3 [ 1771.796966] next_to_watch.status 1 Thanks. This message was sent using IMP, the Internet Messaging Program. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
[EMAIL PROTECTED] wrote: Quoting Frans Pop [EMAIL PROTECTED]: (Note this isn't the final correct patch we should apply. There is no reason why this revert back to the older -poll() logic here should have any effect on the TX hang triggering...) s/no reason/no obvious reason/ ? ;-) The tx code has an early exit that tries to limit the amount of tx packets handled in a single poll loop and requires napi or interrupt rescheduling based on the return value from e1000_clean_tx_irq. see this code in e1000_clean_tx_irq 4005 #ifdef CONFIG_E1000_NAPI 4006 #define E1000_TX_WEIGHT 64 4007 /* weight of a sort for tx, to avoid endless transmit cleanup */ 4008 if (count++ == E1000_TX_WEIGHT) break; 4009 #endif I think that is probably related. For a test you could apply the original patch, and remove this break just by commenting out line 4008. This would guarantee all tx work is cleaned at every e1000_clean Jesse -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
From: Brandeburg, Jesse [EMAIL PROTECTED] Date: Tue, 15 Jan 2008 13:53:43 -0800 The tx code has an early exit that tries to limit the amount of tx packets handled in a single poll loop and requires napi or interrupt rescheduling based on the return value from e1000_clean_tx_irq. That explains everything, thanks Jesse. Ok, here is the patch I'll propose to fix this. The goal is to make it as simple as possible without regressing the thing we were trying to fix. Something more sophisticated can be done later. Three of the 5 Intel drivers had the TX breakout logic. e1000, e1000e, and ixgbe. e100 and ixgb did not, so they don't have any problems we need to fix here. What the fix does is behave as if the budget was fully consumed if *_clean_tx_irq() returns true. The only valid way to return from -poll() without copleting the NAPI poll is by returning work_done == budget. That signals to the caller that the NAPI instance has not been descheduled and therefore the caller fully owns the NAPI context. This does mean that for these drivers any time TX work is done, we'll loop at least one extra time in the -poll() loop of net_rx_work() but that is historically what these drivers have caused to happen for years. For 2.6.25 or similar I would suggest investigating courses of action to bring closure and consistency to this: 1) Determine whether the loop breakout is actually necessary. Jesse explained to me that they had seen a case where a thread on one cpu feeding the TX ring could keep a thread on another cpu constantly running the *_clean_tx_irq() code in a loop. I find this hard to believe since even the slowest CPU should be able to free up TX entries faster than they can be transmitted on gigabit links :-) 2) If the investigation in #1 deems the breakout logic is necessary, then consistently amongst all the 5 drivers a policy should be implemented which is integrated with the NAPI budgetting logic. For example, the simplest thing to do is to pass the budget and the work_done thing down into *_clean_tx_irq() and break out if it is exceeded. As a further refinement we can say that TX work is about 1/4 the expense of RX work and adjust the budget checking logic to match that. [NET]: Fix TX timeout regression in Intel drivers. This fixes a regression added by changeset 53e52c729cc169db82a6105fac7a166e10c2ec36 ([NET]: Make -poll() breakout consistent in Intel ethernet drivers.) As pointed out by Jesse Brandeburg, for three of the drivers edited above there is breakout logic in the *_clean_tx_irq() code to prevent running TX reclaim forever. If this occurs, we have to elide NAPI poll completion or else those TX events will never be serviced. Signed-off-by: David S. Miller [EMAIL PROTECTED] diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c index 13d57b0..0c9a6f7 100644 --- a/drivers/net/e1000/e1000_main.c +++ b/drivers/net/e1000/e1000_main.c @@ -3919,7 +3919,7 @@ e1000_clean(struct napi_struct *napi, int budget) { struct e1000_adapter *adapter = container_of(napi, struct e1000_adapter, napi); struct net_device *poll_dev = adapter-netdev; - int work_done = 0; + int tx_cleaned = 0, work_done = 0; /* Must NOT use netdev_priv macro here. */ adapter = poll_dev-priv; @@ -3929,14 +3929,17 @@ e1000_clean(struct napi_struct *napi, int budget) * simultaneously. A failure obtaining the lock means * tx_ring[0] is currently being cleaned anyway. */ if (spin_trylock(adapter-tx_queue_lock)) { - e1000_clean_tx_irq(adapter, - adapter-tx_ring[0]); + tx_cleaned = e1000_clean_tx_irq(adapter, + adapter-tx_ring[0]); spin_unlock(adapter-tx_queue_lock); } adapter-clean_rx(adapter, adapter-rx_ring[0], work_done, budget); + if (tx_cleaned) + work_done = budget; + /* If budget not fully consumed, exit the polling mode */ if (work_done budget) { if (likely(adapter-itr_setting 3)) diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c index 4a6fc74..2ab3bfb 100644 --- a/drivers/net/e1000e/netdev.c +++ b/drivers/net/e1000e/netdev.c @@ -1384,7 +1384,7 @@ static int e1000_clean(struct napi_struct *napi, int budget) { struct e1000_adapter *adapter = container_of(napi, struct e1000_adapter, napi); struct net_device *poll_dev = adapter-netdev; - int work_done = 0; + int tx_cleaned = 0, work_done = 0; /* Must NOT use netdev_priv macro here. */ adapter = poll_dev-priv; @@ -1394,12 +1394,15 @@ static int e1000_clean(struct napi_struct *napi, int budget) * simultaneously. A failure obtaining the lock means * tx_ring is currently being cleaned anyway. */ if
Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
From: Frans Pop [EMAIL PROTECTED] Date: Tue, 15 Jan 2008 06:25:10 +0100 kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang Does this make the problem go away? (Note this isn't the final correct patch we should apply. There is no reason why this revert back to the older -poll() logic here should have any effect on the TX hang triggering...) diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c index 13d57b0..cada32c 100644 --- a/drivers/net/e1000/e1000_main.c +++ b/drivers/net/e1000/e1000_main.c @@ -3919,7 +3919,7 @@ e1000_clean(struct napi_struct *napi, int budget) { struct e1000_adapter *adapter = container_of(napi, struct e1000_adapter, napi); struct net_device *poll_dev = adapter-netdev; - int work_done = 0; + int tx_work = 0, work_done = 0; /* Must NOT use netdev_priv macro here. */ adapter = poll_dev-priv; @@ -3929,8 +3929,8 @@ e1000_clean(struct napi_struct *napi, int budget) * simultaneously. A failure obtaining the lock means * tx_ring[0] is currently being cleaned anyway. */ if (spin_trylock(adapter-tx_queue_lock)) { - e1000_clean_tx_irq(adapter, - adapter-tx_ring[0]); + tx_work = e1000_clean_tx_irq(adapter, +adapter-tx_ring[0]); spin_unlock(adapter-tx_queue_lock); } @@ -3938,7 +3938,7 @@ e1000_clean(struct napi_struct *napi, int budget) work_done, budget); /* If budget not fully consumed, exit the polling mode */ - if (work_done budget) { + if (!tx_work (work_done budget)) { if (likely(adapter-itr_setting 3)) e1000_set_itr(adapter); netif_rx_complete(poll_dev, napi); -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
After compiling v2.6.24-rc7-163-g1a1b285 (x86_64) yesterday I suddenly see this error repeatedly: kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang kernel: Tx Queue 0 kernel: TDH a kernel: TDT a kernel: next_to_use a kernel: next_to_cleanff kernel: buffer_info[next_to_clean] kernel: time_stamp 10002738a kernel: next_to_watchff kernel: jiffies 1000275b4 kernel: next_to_watch.status 1 My previous kernel was v2.6.24-rc7 and with that this error did not occur. I have also never seen it with earlier kernels. The values for TX Queue and next_to_watch.status are constant, the others vary. My NIC is: 01:00.0 Ethernet controller [0200]: Intel Corporation 82573E Gigabit Ethernet Controller (Copper) (rev 03) 01:00.0 0200: 8086:108c (rev 03) Subsystem: 8086:3096 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 1273 Region 0: Memory at 9020 (32-bit, non-prefetchable) [size=128K] Region 1: Memory at 9010 (32-bit, non-prefetchable) [size=1M] Region 2: I/O ports at 1000 [size=32] Capabilities: [c8] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=1 PME- Capabilities: [d0] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+ Address: fee0300c Data: 41a9 Capabilities: [e0] Express Endpoint IRQ 0 Device: Supported: MaxPayload 256 bytes, PhantFunc 0, ExtTag- Device: Latency L0s 512ns, L1 64us Device: AtnBtn- AtnInd- PwrInd- Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported- Device: RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ Device: MaxPayload 128 bytes, MaxReadReq 512 bytes Link: Supported Speed 2.5Gb/s, Width x1, ASPM unknown, Port 0 Link: Latency L0s 128ns, L1 64us Link: ASPM Disabled RCB 64 bytes CommClk+ ExtSynch- Link: Speed 2.5Gb/s, Width x1 The system is an Intel D945GCZ main board with Intel(R) Pentium(R) D CPU 3.20GHz (dual core) processor. Cheers, FJP -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
Wow. That's fast! :-) On Tuesday 15 January 2008, David Miller wrote: From: Frans Pop [EMAIL PROTECTED] kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang Does this make the problem go away? I'm compiling a kernel with the patch now. Will let you know the result. May take a while as I don't know how to trigger the bug, so I'll just have to let it run for some time. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 Detected Tx Unit Hang
Jesse, today the server froze and was not able to see anything in the logs. Nothing at all about any error, just plain froze. Just in case, this is a different unit altogether, still the same model as the units having the Tx Unit Hang, but different memory, motherboard and CPU. The only 1 thing that is the same is the hard drive a regular IDE... The only one thing I noticed that is very weird to me at least is that in powering off the unit from the crash and rebooting it I saw some lines like this in the logs.. Sep 16 11:08:03 www kernel: checking if image is initramfs... it is Sep 16 07:05:19 www sysctl: kernel.msgmnb = 65536 The odd part is the diff in the time stamps between one entry and the very next one in the log. Any ideas what can cause this? Also, any way to get a dump or some way to prevent the system from locking without any log entries? Regards, Paul - Original Message - From: Jesse Brandeburg [EMAIL PROTECTED] To: Paul Aviles [EMAIL PROTECTED] Cc: netdev@vger.kernel.org Sent: Tuesday, September 05, 2006 12:09 PM Subject: Re: e1000 Detected Tx Unit Hang On 9/3/06, Paul Aviles [EMAIL PROTECTED] wrote: Hey Jesse, thanks for your reply. Here is the stuff on /procs. The weird no problem, part is that I have several other identical systems and only one is affected. Today I moved the hard drive to another similar system and I am not seeing the problem so I am wondering if is something maybe wrong with the card eeprom? Is there a way to check that? I doubt it is an eeprom problem. you can dump the eeproms with ethtool -e eth0 from both machines and compare them . Odd that only one system is having the problem. Could it be that the hardware on that box is having issues? Are you sure the machines are running the same bios version with the same settings? Any overclocking? cat /proc/interrupts CPU0 CPU1 16: 70540 0 IO-APIC-level uhci_hcd:usb4, eth0 this could contribute to your problem, were you able to test without NAPI? Jesse - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 Detected Tx Unit Hang
Jesse, testing without NAPI, will see how it behaves. Paul Aviles - Original Message - From: Jesse Brandeburg [EMAIL PROTECTED] To: Paul Aviles [EMAIL PROTECTED] Cc: netdev@vger.kernel.org Sent: Tuesday, September 05, 2006 12:09 PM Subject: Re: e1000 Detected Tx Unit Hang On 9/3/06, Paul Aviles [EMAIL PROTECTED] wrote: Hey Jesse, thanks for your reply. Here is the stuff on /procs. The weird no problem, part is that I have several other identical systems and only one is affected. Today I moved the hard drive to another similar system and I am not seeing the problem so I am wondering if is something maybe wrong with the card eeprom? Is there a way to check that? I doubt it is an eeprom problem. you can dump the eeproms with ethtool -e eth0 from both machines and compare them . Odd that only one system is having the problem. Could it be that the hardware on that box is having issues? Are you sure the machines are running the same bios version with the same settings? Any overclocking? cat /proc/interrupts CPU0 CPU1 16: 70540 0 IO-APIC-level uhci_hcd:usb4, eth0 this could contribute to your problem, were you able to test without NAPI? Jesse - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 Detected Tx Unit Hang
On 9/3/06, Paul Aviles [EMAIL PROTECTED] wrote: Hey Jesse, thanks for your reply. Here is the stuff on /procs. The weird no problem, part is that I have several other identical systems and only one is affected. Today I moved the hard drive to another similar system and I am not seeing the problem so I am wondering if is something maybe wrong with the card eeprom? Is there a way to check that? I doubt it is an eeprom problem. you can dump the eeproms with ethtool -e eth0 from both machines and compare them . Odd that only one system is having the problem. Could it be that the hardware on that box is having issues? Are you sure the machines are running the same bios version with the same settings? Any overclocking? cat /proc/interrupts CPU0 CPU1 16: 70540 0 IO-APIC-level uhci_hcd:usb4, eth0 this could contribute to your problem, were you able to test without NAPI? Jesse - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 Detected Tx Unit Hang
I haven't done the NAPI yet. These are identical systems altogether, maybe the CPU is a different stepping at the most, but that is all. The 16: 70540 0 IO-APIC-level uhci_hcd:usb4, eth0 is the same in every GS12 I have. No overclocking and same BIOS. Tyan released ver 1.8 about a month ago and I did the upgrade and same effect. Then I thought about upgrading to 2.6.17.11 just to see if the driver will have any issues and nothing, same deal. The only way I was able to control it was usign a dummy 10/100 non-management switch. Then we had no issues. I will try without NAPI tomorrow 9-6-06 and will report back. My understanding on NAPI was that it will drop packets by design on overload. Why will that cause a system lock? Are there any other kernel options you would like to enable to track this better and if you need remote access to the system I can accomodate too, just let me know what time zone you are to schedule it. Let me know. Regards, Paul Aviles - Original Message - From: Jesse Brandeburg [EMAIL PROTECTED] To: Paul Aviles [EMAIL PROTECTED] Cc: netdev@vger.kernel.org Sent: Tuesday, September 05, 2006 12:09 PM Subject: Re: e1000 Detected Tx Unit Hang On 9/3/06, Paul Aviles [EMAIL PROTECTED] wrote: Hey Jesse, thanks for your reply. Here is the stuff on /procs. The weird no problem, part is that I have several other identical systems and only one is affected. Today I moved the hard drive to another similar system and I am not seeing the problem so I am wondering if is something maybe wrong with the card eeprom? Is there a way to check that? I doubt it is an eeprom problem. you can dump the eeproms with ethtool -e eth0 from both machines and compare them . Odd that only one system is having the problem. Could it be that the hardware on that box is having issues? Are you sure the machines are running the same bios version with the same settings? Any overclocking? cat /proc/interrupts CPU0 CPU1 16: 70540 0 IO-APIC-level uhci_hcd:usb4, eth0 this could contribute to your problem, were you able to test without NAPI? Jesse - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 Detected Tx Unit Hang
On 9/2/06, Paul Aviles [EMAIL PROTECTED] wrote: I am getting e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang using stock 2.6.17.11, 2.6.17.5 or 2.6.17.4 kernels on centos 4.3. The server is a Tyan GS12 ( 82541GI/PI and 82547GI) and is connected to a Netgear GS724T Gig switch. I can easily reproduce the problem by trying to do a large ftp transfer to the server. It does not happen if the server is connected to a dummy 100 Mb switch, only when is connected to the Gig switch. I have also tried the options line below disabling tso, tx and rx in the modprobe.conf without any luck. Hi Paul, sorry to hear about your problem. You're getting hangs on the 82547 right? can you send the output of cat /proc/interrupts. I'm curious if you are sharing interrupts while running NAPI. Also, please try the driver without CONFIG_E1000_NAPI enabled in your kernel .config, and let us know the results. Someone has posted (what they think is) a theoretical problem with irq_sem on the 82547 at e1000.sf.net and I haven't had a chance to figure it out yet. Jesse -- VGER BF report: U 0.495355 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 Detected Tx Unit Hang
Hey Jesse, thanks for your reply. Here is the stuff on /procs. The weird part is that I have several other identical systems and only one is affected. Today I moved the hard drive to another similar system and I am not seeing the problem so I am wondering if is something maybe wrong with the card eeprom? Is there a way to check that? Regards, Paul cat /proc/interrupts CPU0 CPU1 0:7716253 0IO-APIC-edge timer 3: 11538 0IO-APIC-edge serial 8: 1 0IO-APIC-edge rtc 9: 0 0 IO-APIC-level acpi 14: 93406 0IO-APIC-edge ide0 16: 70540 0 IO-APIC-level uhci_hcd:usb4, eth0 17: 2 0 IO-APIC-level ehci_hcd:usb1 18: 0 0 IO-APIC-level uhci_hcd:usb2, uhci_hcd:usb5 19: 90 0 IO-APIC-level uhci_hcd:usb3 NMI: 0 0 LOC:77158397715838 ERR: 0 MIS: 0 - Original Message - From: Jesse Brandeburg [EMAIL PROTECTED] To: Paul Aviles [EMAIL PROTECTED] Cc: netdev@vger.kernel.org Sent: Sunday, September 03, 2006 1:45 PM Subject: Re: e1000 Detected Tx Unit Hang On 9/2/06, Paul Aviles [EMAIL PROTECTED] wrote: I am getting e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang using stock 2.6.17.11, 2.6.17.5 or 2.6.17.4 kernels on centos 4.3. The server is a Tyan GS12 ( 82541GI/PI and 82547GI) and is connected to a Netgear GS724T Gig switch. I can easily reproduce the problem by trying to do a large ftp transfer to the server. It does not happen if the server is connected to a dummy 100 Mb switch, only when is connected to the Gig switch. I have also tried the options line below disabling tso, tx and rx in the modprobe.conf without any luck. Hi Paul, sorry to hear about your problem. You're getting hangs on the 82547 right? can you send the output of cat /proc/interrupts. I'm curious if you are sharing interrupts while running NAPI. Also, please try the driver without CONFIG_E1000_NAPI enabled in your kernel .config, and let us know the results. Someone has posted (what they think is) a theoretical problem with irq_sem on the 82547 at e1000.sf.net and I haven't had a chance to figure it out yet. Jesse -- VGER BF report: U 0.495355 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- VGER BF report: U 0.516297 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
e1000 Detected Tx Unit Hang
I am getting e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang using stock 2.6.17.11, 2.6.17.5 or 2.6.17.4 kernels on centos 4.3. The server is a Tyan GS12 ( 82541GI/PI and 82547GI) and is connected to a Netgear GS724T Gig switch. I can easily reproduce the problem by trying to do a large ftp transfer to the server. It does not happen if the server is connected to a dummy 100 Mb switch, only when is connected to the Gig switch. I have also tried the options line below disabling tso, tx and rx in the modprobe.conf without any luck. options e1000 XsumRX=0 Speed=1000 Duplex=2 InterruptThrottleRate=0 FlowControl=3 RxDescriptors=4096 TxDescriptors=4096 RxIntDelay=0 TxIntDelay=0 in /var/log/kernel I get the following... Sep 1 23:53:01 www kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang Sep 1 23:53:01 www kernel: Tx Queue 0 Sep 1 23:53:01 www kernel: TDH 4c4 Sep 1 23:53:01 www kernel: TDT 4c9 Sep 1 23:53:01 www kernel: next_to_use 4c9 Sep 1 23:53:01 www kernel: next_to_clean4c4 Sep 1 23:53:01 www kernel: buffer_info[next_to_clean] Sep 1 23:53:01 www kernel: time_stamp 9c60 Sep 1 23:53:01 www kernel: next_to_watch4c4 Sep 1 23:53:01 www kernel: jiffies 9d96 Sep 1 23:53:01 www kernel: next_to_watch.status 0 . repeats the same as above a few times . Sep 1 23:53:10 www kernel: NETDEV WATCHDOG: eth0: transmit timed out Sep 1 23:53:13 www kernel: e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex then the server locks up, no response from the keyboard at all and must be forced down with a power kill. The suggested tips on how to deal with this issue are not working so if I can help troubleshoot this let me know. Here is my system info, driver: e1000 version: 7.0.33-k2-NAPI firmware-version: N/A bus-info: :02:01.0 lspci -vv output below.. 00:00.0 Host bridge: Intel Corporation 82875P/E7210 Memory Controller Hub (rev 02) Subsystem: Intel Corporation 82875P/E7210 Memory Controller Hub Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=fast TAbort- TAbort- MAbort+ SERR- PERR- Latency: 0 Region 0: Memory at 9000 (32-bit, prefetchable) [size=128M] Capabilities: [e4] Vendor Specific Information Capabilities: [a0] AGP version 3.0 Status: RQ=32 Iso- ArqSz=2 Cal=0 SBA+ ITACoh- GART64- HTrans- 64bit- FW+ AGP3- Rate=x1,x2,x4 Command: RQ=1 ArqSz=0 Cal=0 SBA- AGP- GART64- 64bit- FW- Rate=none 00:01.0 PCI bridge: Intel Corporation 82875P Processor to AGP Controller (rev 02) (prog-if 00 [Normal decode]) Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- Latency: 64 Bus: primary=00, secondary=01, subordinate=01, sec-latency=0 Secondary status: 66Mhz+ FastB2B+ ParErr- DEVSEL=medium TAbort- TAbort- MAbort+ SERR- PERR- BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- Reset- FastB2B- 00:03.0 PCI bridge: Intel Corporation 82875P/E7210 Processor to PCI to CSA Bridge (rev 02) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap- 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- Latency: 32 Bus: primary=00, secondary=02, subordinate=02, sec-latency=0 I/O behind bridge: 2000-2fff Memory behind bridge: fc10-fc1f Secondary status: 66Mhz+ FastB2B+ ParErr- DEVSEL=medium TAbort- TAbort- MAbort- SERR- PERR- BridgeCtl: Parity- SERR- NoISA+ VGA- MAbort- Reset- FastB2B- 00:1d.0 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #1 (rev 02) (prog-if 00 [UHCI]) Subsystem: Intel Corporation: Unknown device 24c0 Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium TAbort- TAbort- MAbort- SERR- PERR- Latency: 0 Interrupt: pin A routed to IRQ 18 Region 4: I/O ports at 1400 [size=32] 00:1d.1 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #2 (rev 02) (prog-if 00 [UHCI]) Subsystem: Intel Corporation: Unknown device 24c0 Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium TAbort- TAbort- MAbort- SERR- PERR- Latency: 0 Interrupt: pin B routed to IRQ 19 Region 4: I/O ports at 1420 [size=32] 00:1d.2 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #3 (rev
Re: e1000 Detected Tx Unit Hang
Paul Aviles wrote: I am getting e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang using stock 2.6.17.11, 2.6.17.5 or 2.6.17.4 kernels on centos 4.3. The server is a Tyan GS10 and is connected to a Netgear GS724T Gig switch. I can easily reproduce the problem by trying to do a large ftp transfer to the server. It does not happen if the server is connected to a dummy 100 Mb switch, only when is connected to the Gig switch. I have also tried the options line below disabling tso, tx and rx in the modprobe.conf without any luck. options e1000 XsumRX=0 Speed=1000 Duplex=2 InterruptThrottleRate=0 FlowControl=3 RxDescriptors=4096 TxDescriptors=4096 RxIntDelay=0 TxIntDelay=0 in /var/log/kernel I get the following... Sep 1 23:53:01 www kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang Sep 1 23:53:01 www kernel: Tx Queue 0 Sep 1 23:53:01 www kernel: TDH 4c4 Sep 1 23:53:01 www kernel: TDT 4c9 Sep 1 23:53:01 www kernel: next_to_use 4c9 Sep 1 23:53:01 www kernel: next_to_clean4c4 Sep 1 23:53:01 www kernel: buffer_info[next_to_clean] Sep 1 23:53:01 www kernel: time_stamp 9c60 Sep 1 23:53:01 www kernel: next_to_watch4c4 Sep 1 23:53:01 www kernel: jiffies 9d96 Sep 1 23:53:01 www kernel: next_to_watch.status 0 . repeats the same as above a few times . Sep 1 23:53:10 www kernel: NETDEV WATCHDOG: eth0: transmit timed out Sep 1 23:53:13 www kernel: e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex then the server locks up, no response from the keyboard at all and must be forced down with a power kill. Here is my driver info, driver: e1000 version: 7.0.33-k2-NAPI firmware-version: N/A bus-info: :02:01.0 What else could I check? [adding netdev to cc, this is a NET issue] This is a known issue and there are several discussions and bugs filed on this. Please read this one where most is documented, and also the netdev http://sourceforge.net/tracker/index.php?func=detailaid=1463045group_id=42302atid=447449 more links and information available on http://e1000.sf.net/ Your debugging information might be needed and helpful, so please take the trouble of digging in the previous bugreports and reporting anything that might be relevant there. The full lockup is certainly not good, but should not necessarily be related to the tx hang (or the cause of that). It is likely that interrupt sharing might be a problem here; what kind of e1000 nic is this? lspci -vv? Cheers, Auke -- VGER BF report: H 0.00334085 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html