Re: e1000: Detected Tx Unit Hang

2008-02-19 Thread Kok, Auke
Bernd Schubert wrote:
 On Saturday 16 February 2008, Kok, Auke wrote:
 Bernd Schubert wrote:
 Hello,

 I can't login to one of our servers and just got this in an ipmi sol
 session:

 [18169.209181] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
 [18169.209183]   Tx Queue 0
 [18169.209184]   TDH  e3
 [18169.209185]   TDT  e3
 [18169.209186]   next_to_use  e3
 [18169.209187]   next_to_cleanbd
 [18169.209188] buffer_info[next_to_clean]
 [18169.209189]   time_stamp   10043e4d2
 [18169.209190]   next_to_watchbe
 [18169.209191]   jiffies  10043e6f6
 [18169.209192]   next_to_watch.status 1
 [18169.256978] e1000: eth2: e1000_clean_tx_irq: Detected Tx Unit Hang
 [18169.256979]   Tx Queue 0
 [18169.256980]   TDH  de
 [18169.256982]   TDT  de
 [18169.256983]   next_to_use  de
 [18169.256984]   next_to_cleanbc
 [18169.256985] buffer_info[next_to_clean]
 [18169.256986]   time_stamp   10043e511
 [18169.256987]   next_to_watchbd
 [18169.256988]   jiffies  10043e701
 [18169.256989]   next_to_watch.status 1

 This is with 2.6.22.18. Is there any chance to recover the system? For
 some reasons I would prefer not to reboot now.
 if that's all you have then it was false alarm. there should be a 'netdev
 timeout - link reset' following those messages. can you send some more
 context on those messages?
 
 All I presently know is that there are 20 servers and login doesn't work any 
 more - sysrq+t does show me it hangs in fuse, which is accessing the 
 underlying nfs (we are using unionfs-fuse). While I checked the sysrq-t 
 output suddenly these e1000 messages appeared.
 Thinking a bit about it, it either could be 2.6.22.18 has an e1000 bug, which 
 2.6.22.X didn't have (X=16, I think, but I'm not sure) or someone  
 mis-configured the switch/network environment today. 
 Hmm, now that I think about the last part, there already had been other 
 networking problems today, which were supposed to be fixed several hours ago. 
 Seems they didn't fix it properly.
 
 in real tx hang cases, the hardware is reset within 2 seconds, and
 everything continues as normal.
 
 Thanks, this gives me hope I don't need to reboot the serves (reboot would 
 mean I would need to start 60 md-raid rebuilds...).

my first thought after I read this e-mail is that the tx-hang message is just a
symptom of your system not responding or being spinlocked all the time. These TX
hang issues normally completely do not interfere with normal system operation 
and
unless you have continuous TX resets you would be able to logon perfectly fine.

I think you might have hit another kernel bug here... perhaps even unionfs/fuse
related and that certainly looks plausible from your problem description.

looking at the changelog for 2.6.22.16-2.6.22.18 I can't see anything relevant
(see
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.22.y.git;a=shortlog),
but there are definately no e1000 driver changes in that range anyway.

I don't suppose you can do a git-bisect? that would certainly help. I don't 
think
we can rule out anything just yet here.

At least try to revert some of your systems to the previous kernel version and 
see
if the problem goes away...

Auke
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


e1000: Detected Tx Unit Hang

2008-02-15 Thread Bernd Schubert
Hello,

I can't login to one of our servers and just got this in an ipmi sol
session:

[18169.209181] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
[18169.209183]   Tx Queue 0
[18169.209184]   TDH  e3
[18169.209185]   TDT  e3
[18169.209186]   next_to_use  e3
[18169.209187]   next_to_cleanbd
[18169.209188] buffer_info[next_to_clean]
[18169.209189]   time_stamp   10043e4d2
[18169.209190]   next_to_watchbe
[18169.209191]   jiffies  10043e6f6
[18169.209192]   next_to_watch.status 1
[18169.256978] e1000: eth2: e1000_clean_tx_irq: Detected Tx Unit Hang
[18169.256979]   Tx Queue 0
[18169.256980]   TDH  de
[18169.256982]   TDT  de
[18169.256983]   next_to_use  de
[18169.256984]   next_to_cleanbc
[18169.256985] buffer_info[next_to_clean]
[18169.256986]   time_stamp   10043e511
[18169.256987]   next_to_watchbd
[18169.256988]   jiffies  10043e701
[18169.256989]   next_to_watch.status 1

This is with 2.6.22.18. Is there any chance to recover the system? For some
reasons I would prefer not to reboot now.

Thanks,
Bernd

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000: Detected Tx Unit Hang

2008-02-15 Thread Kok, Auke
Bernd Schubert wrote:
 Hello,
 
 I can't login to one of our servers and just got this in an ipmi sol
 session:
 
 [18169.209181] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
 [18169.209183]   Tx Queue 0
 [18169.209184]   TDH  e3
 [18169.209185]   TDT  e3
 [18169.209186]   next_to_use  e3
 [18169.209187]   next_to_cleanbd
 [18169.209188] buffer_info[next_to_clean]
 [18169.209189]   time_stamp   10043e4d2
 [18169.209190]   next_to_watchbe
 [18169.209191]   jiffies  10043e6f6
 [18169.209192]   next_to_watch.status 1
 [18169.256978] e1000: eth2: e1000_clean_tx_irq: Detected Tx Unit Hang
 [18169.256979]   Tx Queue 0
 [18169.256980]   TDH  de
 [18169.256982]   TDT  de
 [18169.256983]   next_to_use  de
 [18169.256984]   next_to_cleanbc
 [18169.256985] buffer_info[next_to_clean]
 [18169.256986]   time_stamp   10043e511
 [18169.256987]   next_to_watchbd
 [18169.256988]   jiffies  10043e701
 [18169.256989]   next_to_watch.status 1
 
 This is with 2.6.22.18. Is there any chance to recover the system? For some
 reasons I would prefer not to reboot now.

if that's all you have then it was false alarm. there should be a 'netdev 
timeout
- link reset' following those messages. can you send some more context on those
messages?

in real tx hang cases, the hardware is reset within 2 seconds, and everything
continues as normal.

Auke
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000: Detected Tx Unit Hang

2008-02-15 Thread Bernd Schubert
On Saturday 16 February 2008, Kok, Auke wrote:
 Bernd Schubert wrote:
  Hello,
 
  I can't login to one of our servers and just got this in an ipmi sol
  session:
 
  [18169.209181] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
  [18169.209183]   Tx Queue 0
  [18169.209184]   TDH  e3
  [18169.209185]   TDT  e3
  [18169.209186]   next_to_use  e3
  [18169.209187]   next_to_cleanbd
  [18169.209188] buffer_info[next_to_clean]
  [18169.209189]   time_stamp   10043e4d2
  [18169.209190]   next_to_watchbe
  [18169.209191]   jiffies  10043e6f6
  [18169.209192]   next_to_watch.status 1
  [18169.256978] e1000: eth2: e1000_clean_tx_irq: Detected Tx Unit Hang
  [18169.256979]   Tx Queue 0
  [18169.256980]   TDH  de
  [18169.256982]   TDT  de
  [18169.256983]   next_to_use  de
  [18169.256984]   next_to_cleanbc
  [18169.256985] buffer_info[next_to_clean]
  [18169.256986]   time_stamp   10043e511
  [18169.256987]   next_to_watchbd
  [18169.256988]   jiffies  10043e701
  [18169.256989]   next_to_watch.status 1
 
  This is with 2.6.22.18. Is there any chance to recover the system? For
  some reasons I would prefer not to reboot now.

 if that's all you have then it was false alarm. there should be a 'netdev
 timeout - link reset' following those messages. can you send some more
 context on those messages?

All I presently know is that there are 20 servers and login doesn't work any 
more - sysrq+t does show me it hangs in fuse, which is accessing the 
underlying nfs (we are using unionfs-fuse). While I checked the sysrq-t 
output suddenly these e1000 messages appeared.
Thinking a bit about it, it either could be 2.6.22.18 has an e1000 bug, which 
2.6.22.X didn't have (X=16, I think, but I'm not sure) or someone  
mis-configured the switch/network environment today. 
Hmm, now that I think about the last part, there already had been other 
networking problems today, which were supposed to be fixed several hours ago. 
Seems they didn't fix it properly.


 in real tx hang cases, the hardware is reset within 2 seconds, and
 everything continues as normal.

Thanks, this gives me hope I don't need to reboot the serves (reboot would 
mean I would need to start 60 md-raid rebuilds...).

Thanks,
Bernd
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-21 Thread David Miller
From: Robert Olsson [EMAIL PROTECTED]
Date: Mon, 21 Jan 2008 14:27:13 +0100

  Yes it works. e1000 tested for ~3 hours with high very high load and 
  interface up/down every 5:th sec. Without the patch the irq's gets 
  disabled within a couple of seconds
 
  A resolute way of handling the semaphores. :)

  Signed-off-by: Robert Olsson [EMAIL PROTECTED]


Thanks for testing Robert.

I sent off that fix to Linus an hour or so ago, hopefully
he will pick it up some time today.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-21 Thread Robert Olsson

David Miller writes:

  Yes, this semaphore thing is highly problematic.  In the most crucial
  areas where network driver consistency matters the most for ease of
  understanding and debugging, the Intel drivers choose to be different
  :-(
  
  The way the napi_disable() logic breaks out from high packet load in
  net_rx_action() is it simply returns even leaving interrupts disabled
  when a pending napi_disable() is pending.
  
  This is what trips up the semaphore logic.
  
  Robert, give this patch a try.


 Yes it works. e1000 tested for ~3 hours with high very high load and 
 interface up/down every 5:th sec. Without the patch the irq's gets 
 disabled within a couple of seconds

 A resolute way of handling the semaphores. :)
   
 Signed-off-by: Robert Olsson [EMAIL PROTECTED]
 
 Cheers
--ro


  In the long term this semaphore should be completely eliminated,
  there is no justification for it.
  
  Signed-off-by: David S. Miller [EMAIL PROTECTED]
  
  diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
  index 0c9a6f7..76c0fa6 100644
  --- a/drivers/net/e1000/e1000_main.c
  +++ b/drivers/net/e1000/e1000_main.c
  @@ -632,6 +632,7 @@ e1000_down(struct e1000_adapter *adapter)
   
   #ifdef CONFIG_E1000_NAPI
   napi_disable(adapter-napi);
  +atomic_set(adapter-irq_sem, 0);
   #endif
   e1000_irq_disable(adapter);
   
  diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
  index 2ab3bfb..9cc5a6b 100644
  --- a/drivers/net/e1000e/netdev.c
  +++ b/drivers/net/e1000e/netdev.c
  @@ -2183,6 +2183,7 @@ void e1000e_down(struct e1000_adapter *adapter)
   msleep(10);
   
   napi_disable(adapter-napi);
  +atomic_set(adapter-irq_sem, 0);
   e1000_irq_disable(adapter);
   
   del_timer_sync(adapter-watchdog_timer);
  diff --git a/drivers/net/ixgb/ixgb_main.c b/drivers/net/ixgb/ixgb_main.c
  index d2fb88d..4f63839 100644
  --- a/drivers/net/ixgb/ixgb_main.c
  +++ b/drivers/net/ixgb/ixgb_main.c
  @@ -296,6 +296,11 @@ ixgb_down(struct ixgb_adapter *adapter, boolean_t 
  kill_watchdog)
   {
   struct net_device *netdev = adapter-netdev;
   
  +#ifdef CONFIG_IXGB_NAPI
  +napi_disable(adapter-napi);
  +atomic_set(adapter-irq_sem, 0);
  +#endif
  +
   ixgb_irq_disable(adapter);
   free_irq(adapter-pdev-irq, netdev);
   
  @@ -304,9 +309,7 @@ ixgb_down(struct ixgb_adapter *adapter, boolean_t 
  kill_watchdog)
   
   if(kill_watchdog)
   del_timer_sync(adapter-watchdog_timer);
  -#ifdef CONFIG_IXGB_NAPI
  -napi_disable(adapter-napi);
  -#endif
  +
   adapter-link_speed = 0;
   adapter-link_duplex = 0;
   netif_carrier_off(netdev);
  diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
  index de3f45e..a4265bc 100644
  --- a/drivers/net/ixgbe/ixgbe_main.c
  +++ b/drivers/net/ixgbe/ixgbe_main.c
  @@ -1409,9 +1409,11 @@ void ixgbe_down(struct ixgbe_adapter *adapter)
   IXGBE_WRITE_FLUSH(adapter-hw);
   msleep(10);
   
  +napi_disable(adapter-napi);
  +atomic_set(adapter-irq_sem, 0);
  +
   ixgbe_irq_disable(adapter);
   
  -napi_disable(adapter-napi);
   del_timer_sync(adapter-watchdog_timer);
   
   netif_carrier_off(netdev);
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-20 Thread Brandeburg, Jesse
David Miller wrote:
 From: Robert Olsson [EMAIL PROTECTED]
 Date: Fri, 18 Jan 2008 14:00:57 +0100
 
  I don't understand the idea with semaphore for enabling/disabling
  irq's either the overall logic must safer/better without it.
 
 They must have had code paths where they didn't know if IRQs were
 enabled or not already, so they tried to create something which
 approximates the:
 
   local_irq_save(flags);
   local_irq_restore(flags);
 
 constructs we have for CPU interrupts, so they could go:
 
   e1000_irq_disable();
   /* ... */
   e1000_irq_enable();
 
 and this would work even if the caller was running
 with e1000 interrupts disabled already.
 
 Or, something like that... it is indeed confusing.
 
 Anyways, yes it's totally bogus and should be removed.

I agree, bogus, and in fact I've already removed it from our development
version of ixgbe.  Right now I wanted to report I can't remove e1000 at
all on 2.6.24-rc8+git

I continually get the
 kernel: unregister_netdevice: waiting for eth2 to become free. Usage
count = 1

Where 2.6.24-rc5 e1000 is okay still.  Seems like maybe we are still
missing a netif_rx_complete or a napi_disable somewhere.

I don't think this problem has anything to do with the irq_sem right
now.  Something is still badly broken.  I am just using the interface
regularly (no heavy load) and I can't unload the module.

Jesse
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-20 Thread Andrey Rahmatullin
On Sun, Jan 20, 2008 at 01:20:11AM -0800, Brandeburg, Jesse wrote:
 I continually get the
  kernel: unregister_netdevice: waiting for eth2 to become free. Usage
 count = 1
http://bugzilla.kernel.org/show_bug.cgi?id=9778

-- 
WBR, wRAR (ALT Linux Team)


signature.asc
Description: Digital signature


Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-20 Thread Badalian Vyacheslav

Hello. Its work, thanks for resend it!
Sorry, i understand that patch 53e52c729cc169db82a6105fac7a166e10c2ec36 
([NET]: Make -poll() breakout consistent in Intel ethernet drivers.) 
have regression and rollback it, i not see your patch.

Sorry again.

Thanks!

From: Badalian Vyacheslav [EMAIL PROTECTED]
Date: Wed, 16 Jan 2008 12:02:28 +0300

  

Also have regression after apply patch.



BTW, if you are using the e1000e driver then this initial
patch will not work.

My more recent patch posting for this problem, will.

I include it again below for you:

[NET]: Fix TX timeout regression in Intel drivers.

This fixes a regression added by changeset
53e52c729cc169db82a6105fac7a166e10c2ec36 ([NET]: Make -poll()
breakout consistent in Intel ethernet drivers.)

As pointed out by Jesse Brandeburg, for three of the drivers edited
above there is breakout logic in the *_clean_tx_irq() code to prevent
running TX reclaim forever.  If this occurs, we have to elide NAPI
poll completion or else those TX events will never be serviced.

Signed-off-by: David S. Miller [EMAIL PROTECTED]

diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 13d57b0..0c9a6f7 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -3919,7 +3919,7 @@ e1000_clean(struct napi_struct *napi, int budget)
 {
struct e1000_adapter *adapter = container_of(napi, struct 
e1000_adapter, napi);
struct net_device *poll_dev = adapter-netdev;
-   int work_done = 0;
+   int tx_cleaned = 0, work_done = 0;
 
 	/* Must NOT use netdev_priv macro here. */

adapter = poll_dev-priv;
@@ -3929,14 +3929,17 @@ e1000_clean(struct napi_struct *napi, int budget)
 * simultaneously.  A failure obtaining the lock means
 * tx_ring[0] is currently being cleaned anyway. */
if (spin_trylock(adapter-tx_queue_lock)) {
-   e1000_clean_tx_irq(adapter,
-  adapter-tx_ring[0]);
+   tx_cleaned = e1000_clean_tx_irq(adapter,
+   adapter-tx_ring[0]);
spin_unlock(adapter-tx_queue_lock);
}
 
 	adapter-clean_rx(adapter, adapter-rx_ring[0],

  work_done, budget);
 
+	if (tx_cleaned)

+   work_done = budget;
+
/* If budget not fully consumed, exit the polling mode */
if (work_done  budget) {
if (likely(adapter-itr_setting  3))
diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index 4a6fc74..2ab3bfb 100644
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -1384,7 +1384,7 @@ static int e1000_clean(struct napi_struct *napi, int 
budget)
 {
struct e1000_adapter *adapter = container_of(napi, struct 
e1000_adapter, napi);
struct net_device *poll_dev = adapter-netdev;
-   int work_done = 0;
+   int tx_cleaned = 0, work_done = 0;
 
 	/* Must NOT use netdev_priv macro here. */

adapter = poll_dev-priv;
@@ -1394,12 +1394,15 @@ static int e1000_clean(struct napi_struct *napi, int 
budget)
 * simultaneously.  A failure obtaining the lock means
 * tx_ring is currently being cleaned anyway. */
if (spin_trylock(adapter-tx_queue_lock)) {
-   e1000_clean_tx_irq(adapter);
+   tx_cleaned = e1000_clean_tx_irq(adapter);
spin_unlock(adapter-tx_queue_lock);
}
 
 	adapter-clean_rx(adapter, work_done, budget);
 
+	if (tx_cleaned)

+   work_done = budget;
+
/* If budget not fully consumed, exit the polling mode */
if (work_done  budget) {
if (adapter-itr_setting  3)
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index a564916..de3f45e 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -1468,13 +1468,16 @@ static int ixgbe_clean(struct napi_struct *napi, int 
budget)
struct ixgbe_adapter *adapter = container_of(napi,
struct ixgbe_adapter, napi);
struct net_device *netdev = adapter-netdev;
-   int work_done = 0;
+   int tx_cleaned = 0, work_done = 0;
 
 	/* In non-MSIX case, there is no multi-Tx/Rx queue */

-   ixgbe_clean_tx_irq(adapter, adapter-tx_ring);
+   tx_cleaned = ixgbe_clean_tx_irq(adapter, adapter-tx_ring);
ixgbe_clean_rx_irq(adapter, adapter-rx_ring[0], work_done,
   budget);
 
+	if (tx_cleaned)

+   work_done = budget;
+
/* If budget not fully consumed, exit the polling mode */
if (work_done  budget) {
netif_rx_complete(netdev, napi);

  


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-18 Thread David Miller
From: Robert Olsson [EMAIL PROTECTED]
Date: Wed, 16 Jan 2008 18:07:38 +0100

 
 eth0 e1000_irq_enable sem = 1- High netload
 eth0 e1000_irq_enable sem = 1
 eth0 e1000_irq_enable sem = 1
 eth0 e1000_irq_enable sem = 1
 eth0 e1000_irq_enable sem = 1
 eth0 e1000_irq_enable sem = 1
 eth0 e1000_irq_enable sem = 1- ifconfig eth0 down
 eth0 e1000_irq_disable sem = 2
 
 **e1000_open - ifconfig eth0 up
 eth0 e1000_irq_disable sem = 3  Dead. irq's can't be enabled
 e1000_irq_enable miss
 eth0 e1000_irq_enable sem = 2
 e1000_irq_enable miss
 eth0 e1000_irq_enable sem = 1
 ADDRCONF(NETDEV_UP): eth0: link is not ready

Yes, this semaphore thing is highly problematic.  In the most crucial
areas where network driver consistency matters the most for ease of
understanding and debugging, the Intel drivers choose to be different
:-(

The way the napi_disable() logic breaks out from high packet load in
net_rx_action() is it simply returns even leaving interrupts disabled
when a pending napi_disable() is pending.

This is what trips up the semaphore logic.

Robert, give this patch a try.

In the long term this semaphore should be completely eliminated,
there is no justification for it.

Signed-off-by: David S. Miller [EMAIL PROTECTED]

diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 0c9a6f7..76c0fa6 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -632,6 +632,7 @@ e1000_down(struct e1000_adapter *adapter)
 
 #ifdef CONFIG_E1000_NAPI
napi_disable(adapter-napi);
+   atomic_set(adapter-irq_sem, 0);
 #endif
e1000_irq_disable(adapter);
 
diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index 2ab3bfb..9cc5a6b 100644
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -2183,6 +2183,7 @@ void e1000e_down(struct e1000_adapter *adapter)
msleep(10);
 
napi_disable(adapter-napi);
+   atomic_set(adapter-irq_sem, 0);
e1000_irq_disable(adapter);
 
del_timer_sync(adapter-watchdog_timer);
diff --git a/drivers/net/ixgb/ixgb_main.c b/drivers/net/ixgb/ixgb_main.c
index d2fb88d..4f63839 100644
--- a/drivers/net/ixgb/ixgb_main.c
+++ b/drivers/net/ixgb/ixgb_main.c
@@ -296,6 +296,11 @@ ixgb_down(struct ixgb_adapter *adapter, boolean_t 
kill_watchdog)
 {
struct net_device *netdev = adapter-netdev;
 
+#ifdef CONFIG_IXGB_NAPI
+   napi_disable(adapter-napi);
+   atomic_set(adapter-irq_sem, 0);
+#endif
+
ixgb_irq_disable(adapter);
free_irq(adapter-pdev-irq, netdev);
 
@@ -304,9 +309,7 @@ ixgb_down(struct ixgb_adapter *adapter, boolean_t 
kill_watchdog)
 
if(kill_watchdog)
del_timer_sync(adapter-watchdog_timer);
-#ifdef CONFIG_IXGB_NAPI
-   napi_disable(adapter-napi);
-#endif
+
adapter-link_speed = 0;
adapter-link_duplex = 0;
netif_carrier_off(netdev);
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index de3f45e..a4265bc 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -1409,9 +1409,11 @@ void ixgbe_down(struct ixgbe_adapter *adapter)
IXGBE_WRITE_FLUSH(adapter-hw);
msleep(10);
 
+   napi_disable(adapter-napi);
+   atomic_set(adapter-irq_sem, 0);
+
ixgbe_irq_disable(adapter);
 
-   napi_disable(adapter-napi);
del_timer_sync(adapter-watchdog_timer);
 
netif_carrier_off(netdev);
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-18 Thread David Miller
From: Robert Olsson [EMAIL PROTECTED]
Date: Fri, 18 Jan 2008 14:00:57 +0100

  I don't understand the idea with semaphore for enabling/disabling 
  irq's either the overall logic must safer/better without it.  

They must have had code paths where they didn't know if IRQs were
enabled or not already, so they tried to create something which
approximates the:

local_irq_save(flags);
local_irq_restore(flags);

constructs we have for CPU interrupts, so they could go:

e1000_irq_disable();
/* ... */
e1000_irq_enable();

and this would work even if the caller was running
with e1000 interrupts disabled already.

Or, something like that... it is indeed confusing.

Anyways, yes it's totally bogus and should be removed.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-18 Thread Robert Olsson

David Miller writes:

   eth0 e1000_irq_enable sem = 1- ifconfig eth0 down
   eth0 e1000_irq_disable sem = 2
   
   **e1000_open - ifconfig eth0 up
   eth0 e1000_irq_disable sem = 3  Dead. irq's can't be enabled
   e1000_irq_enable miss
   eth0 e1000_irq_enable sem = 2
   e1000_irq_enable miss
   eth0 e1000_irq_enable sem = 1
   ADDRCONF(NETDEV_UP): eth0: link is not ready
  
  Yes, this semaphore thing is highly problematic.  In the most crucial
  areas where network driver consistency matters the most for ease of
  understanding and debugging, the Intel drivers choose to be different

 I don't understand the idea with semaphore for enabling/disabling 
 irq's either the overall logic must safer/better without it.  
 
  The way the napi_disable() logic breaks out from high packet load in
  net_rx_action() is it simply returns even leaving interrupts disabled
  when a pending napi_disable() is pending.
  
  This is what trips up the semaphore logic.
  
  Robert, give this patch a try.
  
  In the long term this semaphore should be completely eliminated,
  there is no justification for it.

 It's on the testing list...

 Cheers
--ro


  
  Signed-off-by: David S. Miller [EMAIL PROTECTED]
  
  diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
  index 0c9a6f7..76c0fa6 100644
  --- a/drivers/net/e1000/e1000_main.c
  +++ b/drivers/net/e1000/e1000_main.c
  @@ -632,6 +632,7 @@ e1000_down(struct e1000_adapter *adapter)
   
   #ifdef CONFIG_E1000_NAPI
   napi_disable(adapter-napi);
  +atomic_set(adapter-irq_sem, 0);
   #endif
   e1000_irq_disable(adapter);
   
  diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
  index 2ab3bfb..9cc5a6b 100644
  --- a/drivers/net/e1000e/netdev.c
  +++ b/drivers/net/e1000e/netdev.c
  @@ -2183,6 +2183,7 @@ void e1000e_down(struct e1000_adapter *adapter)
   msleep(10);
   
   napi_disable(adapter-napi);
  +atomic_set(adapter-irq_sem, 0);
   e1000_irq_disable(adapter);
   
   del_timer_sync(adapter-watchdog_timer);
  diff --git a/drivers/net/ixgb/ixgb_main.c b/drivers/net/ixgb/ixgb_main.c
  index d2fb88d..4f63839 100644
  --- a/drivers/net/ixgb/ixgb_main.c
  +++ b/drivers/net/ixgb/ixgb_main.c
  @@ -296,6 +296,11 @@ ixgb_down(struct ixgb_adapter *adapter, boolean_t 
  kill_watchdog)
   {
   struct net_device *netdev = adapter-netdev;
   
  +#ifdef CONFIG_IXGB_NAPI
  +napi_disable(adapter-napi);
  +atomic_set(adapter-irq_sem, 0);
  +#endif
  +
   ixgb_irq_disable(adapter);
   free_irq(adapter-pdev-irq, netdev);
   
  @@ -304,9 +309,7 @@ ixgb_down(struct ixgb_adapter *adapter, boolean_t 
  kill_watchdog)
   
   if(kill_watchdog)
   del_timer_sync(adapter-watchdog_timer);
  -#ifdef CONFIG_IXGB_NAPI
  -napi_disable(adapter-napi);
  -#endif
  +
   adapter-link_speed = 0;
   adapter-link_duplex = 0;
   netif_carrier_off(netdev);
  diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
  index de3f45e..a4265bc 100644
  --- a/drivers/net/ixgbe/ixgbe_main.c
  +++ b/drivers/net/ixgbe/ixgbe_main.c
  @@ -1409,9 +1409,11 @@ void ixgbe_down(struct ixgbe_adapter *adapter)
   IXGBE_WRITE_FLUSH(adapter-hw);
   msleep(10);
   
  +napi_disable(adapter-napi);
  +atomic_set(adapter-irq_sem, 0);
  +
   ixgbe_irq_disable(adapter);
   
  -napi_disable(adapter-napi);
   del_timer_sync(adapter-watchdog_timer);
   
   netif_carrier_off(netdev);
  --
  To unsubscribe from this list: send the line unsubscribe netdev in
  the body of a message to [EMAIL PROTECTED]
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-17 Thread David Miller
From: Frans Pop [EMAIL PROTECTED]
Date: Thu, 17 Jan 2008 08:51:55 +0100

 On Thursday 17 January 2008, David Miller wrote:
  From: Brandeburg, Jesse [EMAIL PROTECTED]
 
   We spent Wednesday trying to reproduce (without the patch) these issues
   without much luck, and have applied the patch cleanly and will continue
   testing it.  Given the simplicity of the changes, and the community
   testing, I'll give my ack and we will continue testing.
 
  You need a slow CPU, and you need to make sure you do actually
  trigger the TX limiting code there.
 
 Hmmm. Is a dual core Pentium D 3.20GHz considered slow these days?

No of course :-)  I guess it therefore depends upon the load
as well.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-17 Thread Arnaldo Carvalho de Melo
Em Thu, Jan 17, 2008 at 12:00:02AM -0800, David Miller escreveu:
 From: Frans Pop [EMAIL PROTECTED]
 Date: Thu, 17 Jan 2008 08:51:55 +0100
 
  On Thursday 17 January 2008, David Miller wrote:
   From: Brandeburg, Jesse [EMAIL PROTECTED]
  
We spent Wednesday trying to reproduce (without the patch) these issues
without much luck, and have applied the patch cleanly and will continue
testing it.  Given the simplicity of the changes, and the community
testing, I'll give my ack and we will continue testing.
  
   You need a slow CPU, and you need to make sure you do actually
   trigger the TX limiting code there.
  
  Hmmm. Is a dual core Pentium D 3.20GHz considered slow these days?
 
 No of course :-)  I guess it therefore depends upon the load
 as well.

I saw it just once, yesterday:

[EMAIL PROTECTED] ~]# uname -r
2.6.24-rc5
e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
  Tx Queue 0
  TDH  58
  TDT  8f
  next_to_use  8f
  next_to_clean55
buffer_info[next_to_clean]
  time_stamp   105e973a9
  next_to_watch56
  jiffies  105e97992
  next_to_watch.status 1
[EMAIL PROTECTED] ~]#

on a lenovo T60W, core2duo machine (2GHz), when using it to stress test
another machine, I was using netperf TCP_STREAM ranging from 1 to 8
streams + a ping -f using various packet sizes.

I'll update this machine today to 2.6.24-rc8-git + net-2.6 and try again
to reproduce.

I also applied David's patch while trying some RT experiments on
another, 8 way machine used as a server, but on this machine I didn't
experience the Tx Unit Hang message with or without the patch.

- Arnaldo
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-17 Thread David Miller
From: Arnaldo Carvalho de Melo [EMAIL PROTECTED]
Date: Thu, 17 Jan 2008 07:40:07 -0200

 I'll update this machine today to 2.6.24-rc8-git + net-2.6 and try again
 to reproduce.

Thanks for the datapoints and testing.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-16 Thread Frans Pop
On Wednesday 16 January 2008, David Miller wrote:
 Ok, here is the patch I'll propose to fix this.  The goal is to make
 it as simple as possible without regressing the thing we were trying
 to fix.

Looks good to me. Tested with -rc8.

Cheers,
FJP
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-16 Thread Badalian Vyacheslav

applied to 2.6.24-rc7-git2
Have messages
Also have regression after apply patch.
System may do above 800mbs traffic before patch. After its exit polling 
mode? (4 CPU, 1 cpu get 100% si (process ksoftirqd/0), 3 CPU is IDLE)

After patch system was go to exit polling mode at above 600mbs.

Thanks.


From: Frans Pop [EMAIL PROTECTED]
Date: Tue, 15 Jan 2008 06:25:10 +0100

  

kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang



Does this make the problem go away?

(Note this isn't the final correct patch we should apply.  There
 is no reason why this revert back to the older -poll() logic
 here should have any effect on the TX hang triggering...)

diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 13d57b0..cada32c 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -3919,7 +3919,7 @@ e1000_clean(struct napi_struct *napi, int budget)
 {
struct e1000_adapter *adapter = container_of(napi, struct 
e1000_adapter, napi);
struct net_device *poll_dev = adapter-netdev;
-   int work_done = 0;
+   int tx_work = 0, work_done = 0;
 
 	/* Must NOT use netdev_priv macro here. */

adapter = poll_dev-priv;
@@ -3929,8 +3929,8 @@ e1000_clean(struct napi_struct *napi, int budget)
 * simultaneously.  A failure obtaining the lock means
 * tx_ring[0] is currently being cleaned anyway. */
if (spin_trylock(adapter-tx_queue_lock)) {
-   e1000_clean_tx_irq(adapter,
-  adapter-tx_ring[0]);
+   tx_work = e1000_clean_tx_irq(adapter,
+adapter-tx_ring[0]);
spin_unlock(adapter-tx_queue_lock);
}
 
@@ -3938,7 +3938,7 @@ e1000_clean(struct napi_struct *napi, int budget)

  work_done, budget);
 
 	/* If budget not fully consumed, exit the polling mode */

-   if (work_done  budget) {
+   if (!tx_work  (work_done  budget)) {
if (likely(adapter-itr_setting  3))
e1000_set_itr(adapter);
netif_rx_complete(poll_dev, napi);
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-16 Thread David Miller
From: Frans Pop [EMAIL PROTECTED]
Date: Wed, 16 Jan 2008 09:56:08 +0100

 On Wednesday 16 January 2008, David Miller wrote:
  Ok, here is the patch I'll propose to fix this.  The goal is to make
  it as simple as possible without regressing the thing we were trying
  to fix.
 
 Looks good to me. Tested with -rc8.

Thanks for testing.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-16 Thread David Miller
From: Badalian Vyacheslav [EMAIL PROTECTED]
Date: Wed, 16 Jan 2008 12:02:28 +0300

 Also have regression after apply patch.

BTW, if you are using the e1000e driver then this initial
patch will not work.

My more recent patch posting for this problem, will.

I include it again below for you:

[NET]: Fix TX timeout regression in Intel drivers.

This fixes a regression added by changeset
53e52c729cc169db82a6105fac7a166e10c2ec36 ([NET]: Make -poll()
breakout consistent in Intel ethernet drivers.)

As pointed out by Jesse Brandeburg, for three of the drivers edited
above there is breakout logic in the *_clean_tx_irq() code to prevent
running TX reclaim forever.  If this occurs, we have to elide NAPI
poll completion or else those TX events will never be serviced.

Signed-off-by: David S. Miller [EMAIL PROTECTED]

diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 13d57b0..0c9a6f7 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -3919,7 +3919,7 @@ e1000_clean(struct napi_struct *napi, int budget)
 {
struct e1000_adapter *adapter = container_of(napi, struct 
e1000_adapter, napi);
struct net_device *poll_dev = adapter-netdev;
-   int work_done = 0;
+   int tx_cleaned = 0, work_done = 0;
 
/* Must NOT use netdev_priv macro here. */
adapter = poll_dev-priv;
@@ -3929,14 +3929,17 @@ e1000_clean(struct napi_struct *napi, int budget)
 * simultaneously.  A failure obtaining the lock means
 * tx_ring[0] is currently being cleaned anyway. */
if (spin_trylock(adapter-tx_queue_lock)) {
-   e1000_clean_tx_irq(adapter,
-  adapter-tx_ring[0]);
+   tx_cleaned = e1000_clean_tx_irq(adapter,
+   adapter-tx_ring[0]);
spin_unlock(adapter-tx_queue_lock);
}
 
adapter-clean_rx(adapter, adapter-rx_ring[0],
  work_done, budget);
 
+   if (tx_cleaned)
+   work_done = budget;
+
/* If budget not fully consumed, exit the polling mode */
if (work_done  budget) {
if (likely(adapter-itr_setting  3))
diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index 4a6fc74..2ab3bfb 100644
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -1384,7 +1384,7 @@ static int e1000_clean(struct napi_struct *napi, int 
budget)
 {
struct e1000_adapter *adapter = container_of(napi, struct 
e1000_adapter, napi);
struct net_device *poll_dev = adapter-netdev;
-   int work_done = 0;
+   int tx_cleaned = 0, work_done = 0;
 
/* Must NOT use netdev_priv macro here. */
adapter = poll_dev-priv;
@@ -1394,12 +1394,15 @@ static int e1000_clean(struct napi_struct *napi, int 
budget)
 * simultaneously.  A failure obtaining the lock means
 * tx_ring is currently being cleaned anyway. */
if (spin_trylock(adapter-tx_queue_lock)) {
-   e1000_clean_tx_irq(adapter);
+   tx_cleaned = e1000_clean_tx_irq(adapter);
spin_unlock(adapter-tx_queue_lock);
}
 
adapter-clean_rx(adapter, work_done, budget);
 
+   if (tx_cleaned)
+   work_done = budget;
+
/* If budget not fully consumed, exit the polling mode */
if (work_done  budget) {
if (adapter-itr_setting  3)
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index a564916..de3f45e 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -1468,13 +1468,16 @@ static int ixgbe_clean(struct napi_struct *napi, int 
budget)
struct ixgbe_adapter *adapter = container_of(napi,
struct ixgbe_adapter, napi);
struct net_device *netdev = adapter-netdev;
-   int work_done = 0;
+   int tx_cleaned = 0, work_done = 0;
 
/* In non-MSIX case, there is no multi-Tx/Rx queue */
-   ixgbe_clean_tx_irq(adapter, adapter-tx_ring);
+   tx_cleaned = ixgbe_clean_tx_irq(adapter, adapter-tx_ring);
ixgbe_clean_rx_irq(adapter, adapter-rx_ring[0], work_done,
   budget);
 
+   if (tx_cleaned)
+   work_done = budget;
+
/* If budget not fully consumed, exit the polling mode */
if (work_done  budget) {
netif_rx_complete(netdev, napi);
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-16 Thread David Miller
From: Badalian Vyacheslav [EMAIL PROTECTED]
Date: Wed, 16 Jan 2008 12:02:28 +0300

 applied to 2.6.24-rc7-git2
 Have messages
 Also have regression after apply patch.
 System may do above 800mbs traffic before patch. After its exit polling 
 mode? (4 CPU, 1 cpu get 100% si (process ksoftirqd/0), 3 CPU is IDLE)
 After patch system was go to exit polling mode at above 600mbs.

What do you mean by 'system was go to exit polling mode'?

Please be more clear about your situation, in particular
provide every detail about what happens so that we can
properly debug this.

THanks.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-16 Thread Robert Olsson

David Miller writes:
   On Wednesday 16 January 2008, David Miller wrote:
Ok, here is the patch I'll propose to fix this.  The goal is to make
it as simple as possible without regressing the thing we were trying
to fix.
   
   Looks good to me. Tested with -rc8.
  
  Thanks for testing.

 Yes that code looks nice. I'm using the patch but I've noticed another 
 phenomena with the current e1000 driver. There is a race when taking a 
 device down at high traffic loads. I've tracked and instrumented and it 
 seems like occasionly irq_sem can get bump up so interrupts can't be 
 enabled again.


eth0 e1000_irq_enable sem = 1- High netload
eth0 e1000_irq_enable sem = 1
eth0 e1000_irq_enable sem = 1
eth0 e1000_irq_enable sem = 1
eth0 e1000_irq_enable sem = 1
eth0 e1000_irq_enable sem = 1
eth0 e1000_irq_enable sem = 1- ifconfig eth0 down
eth0 e1000_irq_disable sem = 2

**e1000_open - ifconfig eth0 up
eth0 e1000_irq_disable sem = 3  Dead. irq's can't be enabled
e1000_irq_enable miss
eth0 e1000_irq_enable sem = 2
e1000_irq_enable miss
eth0 e1000_irq_enable sem = 1
ADDRCONF(NETDEV_UP): eth0: link is not ready


Cheers
--ro

static void
e1000_irq_disable(struct e1000_adapter *adapter)
{
atomic_inc(adapter-irq_sem);
E1000_WRITE_REG(adapter-hw, IMC, ~0);
E1000_WRITE_FLUSH(adapter-hw);
synchronize_irq(adapter-pdev-irq);

if(adapter-netdev-ifindex == 3)
printk(%s e1000_irq_disable sem = %d\n,  adapter-netdev-name,
   atomic_read(adapter-irq_sem));
}

static void
e1000_irq_enable(struct e1000_adapter *adapter)
{
if (likely(atomic_dec_and_test(adapter-irq_sem))) {
E1000_WRITE_REG(adapter-hw, IMS, IMS_ENABLE_MASK);
E1000_WRITE_FLUSH(adapter-hw);
}
else
printk(e1000_irq_enable miss\n);

if(adapter-netdev-ifindex == 3)
  printk(%s e1000_irq_enable sem = %d\n,  adapter-netdev-name,
 atomic_read(adapter-irq_sem));
}
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-16 Thread Brandeburg, Jesse
David Miller wrote:
 From: Brandeburg, Jesse [EMAIL PROTECTED]
 Date: Tue, 15 Jan 2008 13:53:43 -0800
 
 The tx code has an early exit that tries to limit the amount of tx
 packets handled in a single poll loop and requires napi or interrupt
 rescheduling based on the return value from e1000_clean_tx_irq.
 
 That explains everything, thanks Jesse.
 
 Ok, here is the patch I'll propose to fix this.  The goal is to make
 it as simple as possible without regressing the thing we were trying
 to fix.

We spent Wednesday trying to reproduce (without the patch) these issues
without much luck, and have applied the patch cleanly and will continue
testing it.  Given the simplicity of the changes, and the community
testing, I'll give my ack and we will continue testing.

I think we should fix Robert's (unrelated, but in this thread) reported
issue before 2.6.24 final if we can, and I'll look at that tonight and
tomorrow.

Thanks for your work on this Dave,
 Jesse

Acked-by: Jesse Brandeburg [EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-16 Thread David Miller
From: Brandeburg, Jesse [EMAIL PROTECTED]
Date: Wed, 16 Jan 2008 23:09:47 -0800

 We spent Wednesday trying to reproduce (without the patch) these issues
 without much luck, and have applied the patch cleanly and will continue
 testing it.  Given the simplicity of the changes, and the community
 testing, I'll give my ack and we will continue testing.

You need a slow CPU, and you need to make sure you do actually
trigger the TX limiting code there.

I bet your cpus are fast enough that it simply never triggers.
:-)

 Acked-by: Jesse Brandeburg [EMAIL PROTECTED]

Thanks for reviewing Jesse.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-16 Thread Frans Pop
On Thursday 17 January 2008, David Miller wrote:
 From: Brandeburg, Jesse [EMAIL PROTECTED]

  We spent Wednesday trying to reproduce (without the patch) these issues
  without much luck, and have applied the patch cleanly and will continue
  testing it.  Given the simplicity of the changes, and the community
  testing, I'll give my ack and we will continue testing.

 You need a slow CPU, and you need to make sure you do actually
 trigger the TX limiting code there.

Hmmm. Is a dual core Pentium D 3.20GHz considered slow these days?
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-15 Thread Frans Pop
On Tuesday 15 January 2008, David Miller wrote:
 From: Frans Pop [EMAIL PROTECTED]
  kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang

 Does this make the problem go away?

Yes, it very much looks like that solves it.
I ran with the patch for 6 hours or so without any errors. I then switched 
back to an unpatched kernel and they reappeared immediately.

 (Note this isn't the final correct patch we should apply.  There
  is no reason why this revert back to the older -poll() logic
  here should have any effect on the TX hang triggering...)

s/no reason/no obvious reason/ ? ;-)

Cheers,
FJP
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-15 Thread slavon

Quoting Frans Pop [EMAIL PROTECTED]:


On Tuesday 15 January 2008, David Miller wrote:

From: Frans Pop [EMAIL PROTECTED]
 kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang

Does this make the problem go away?


Yes, it very much looks like that solves it.
I ran with the patch for 6 hours or so without any errors. I then switched
back to an unpatched kernel and they reappeared immediately.


(Note this isn't the final correct patch we should apply.  There
 is no reason why this revert back to the older -poll() logic
 here should have any effect on the TX hang triggering...)


s/no reason/no obvious reason/ ? ;-)

Cheers,
FJP
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html




Hello.

I also try your patch (apply to 2.6.24-rc7-git2)

I catch this message in dmesg
[ 1771.796954] e1000: eth1: e1000_clean_tx_irq: Detected Tx Unit Hang
[ 1771.796957]   Tx Queue 0
[ 1771.796958]   TDH  54
[ 1771.796959]   TDT  54
[ 1771.796960]   next_to_use  54
[ 1771.796961]   next_to_cleana9
[ 1771.796962] buffer_info[next_to_clean]
[ 1771.796963]   time_stamp   14d72e
[ 1771.796964]   next_to_watcha9
[ 1771.796965]   jiffies  14ddd3
[ 1771.796966]   next_to_watch.status 1

Thanks.



This message was sent using IMP, the Internet Messaging Program.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-15 Thread Brandeburg, Jesse
[EMAIL PROTECTED] wrote:
 Quoting Frans Pop [EMAIL PROTECTED]:
 (Note this isn't the final correct patch we should apply.  There  is
 no reason why this revert back to the older -poll() logic  here
 should have any effect on the TX hang triggering...)
 
 s/no reason/no obvious reason/ ? ;-)

The tx code has an early exit that tries to limit the amount of tx
packets handled in a single poll loop and requires napi or interrupt
rescheduling based on the return value from e1000_clean_tx_irq.

see this code in e1000_clean_tx_irq

4005 #ifdef CONFIG_E1000_NAPI
4006 #define E1000_TX_WEIGHT 64
4007   /* weight of a sort for tx, to avoid endless
transmit cleanup */
4008   if (count++ == E1000_TX_WEIGHT) break;
4009 #endif

I think that is probably related.  For a test you could apply the
original patch, and remove this break just by commenting out line
4008.  This would guarantee all tx work is cleaned at every e1000_clean

Jesse
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-15 Thread David Miller
From: Brandeburg, Jesse [EMAIL PROTECTED]
Date: Tue, 15 Jan 2008 13:53:43 -0800

 The tx code has an early exit that tries to limit the amount of tx
 packets handled in a single poll loop and requires napi or interrupt
 rescheduling based on the return value from e1000_clean_tx_irq.

That explains everything, thanks Jesse.

Ok, here is the patch I'll propose to fix this.  The goal is to make
it as simple as possible without regressing the thing we were trying
to fix.

Something more sophisticated can be done later.

Three of the 5 Intel drivers had the TX breakout logic.  e1000,
e1000e, and ixgbe.  e100 and ixgb did not, so they don't have any
problems we need to fix here.

What the fix does is behave as if the budget was fully consumed if
*_clean_tx_irq() returns true.

The only valid way to return from -poll() without copleting the NAPI
poll is by returning work_done == budget.  That signals to the caller
that the NAPI instance has not been descheduled and therefore the
caller fully owns the NAPI context.

This does mean that for these drivers any time TX work is done, we'll
loop at least one extra time in the -poll() loop of net_rx_work() but
that is historically what these drivers have caused to happen for
years.

For 2.6.25 or similar I would suggest investigating courses of action
to bring closure and consistency to this:

1) Determine whether the loop breakout is actually necessary.
   Jesse explained to me that they had seen a case where a
   thread on one cpu feeding the TX ring could keep a thread
   on another cpu constantly running the *_clean_tx_irq() code
   in a loop.

   I find this hard to believe since even the slowest CPU should be
   able to free up TX entries faster than they can be transmitted on
   gigabit links :-)

2) If the investigation in #1 deems the breakout logic is necessary,
   then consistently amongst all the 5 drivers a policy should be
   implemented which is integrated with the NAPI budgetting logic.
   For example, the simplest thing to do is to pass the budget and the
   work_done thing down into *_clean_tx_irq() and break out if it is
   exceeded.

   As a further refinement we can say that TX work is about 1/4 the
   expense of RX work and adjust the budget checking logic to match
   that.

[NET]: Fix TX timeout regression in Intel drivers.

This fixes a regression added by changeset
53e52c729cc169db82a6105fac7a166e10c2ec36 ([NET]: Make -poll()
breakout consistent in Intel ethernet drivers.)

As pointed out by Jesse Brandeburg, for three of the drivers edited
above there is breakout logic in the *_clean_tx_irq() code to prevent
running TX reclaim forever.  If this occurs, we have to elide NAPI
poll completion or else those TX events will never be serviced.

Signed-off-by: David S. Miller [EMAIL PROTECTED]

diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 13d57b0..0c9a6f7 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -3919,7 +3919,7 @@ e1000_clean(struct napi_struct *napi, int budget)
 {
struct e1000_adapter *adapter = container_of(napi, struct 
e1000_adapter, napi);
struct net_device *poll_dev = adapter-netdev;
-   int work_done = 0;
+   int tx_cleaned = 0, work_done = 0;
 
/* Must NOT use netdev_priv macro here. */
adapter = poll_dev-priv;
@@ -3929,14 +3929,17 @@ e1000_clean(struct napi_struct *napi, int budget)
 * simultaneously.  A failure obtaining the lock means
 * tx_ring[0] is currently being cleaned anyway. */
if (spin_trylock(adapter-tx_queue_lock)) {
-   e1000_clean_tx_irq(adapter,
-  adapter-tx_ring[0]);
+   tx_cleaned = e1000_clean_tx_irq(adapter,
+   adapter-tx_ring[0]);
spin_unlock(adapter-tx_queue_lock);
}
 
adapter-clean_rx(adapter, adapter-rx_ring[0],
  work_done, budget);
 
+   if (tx_cleaned)
+   work_done = budget;
+
/* If budget not fully consumed, exit the polling mode */
if (work_done  budget) {
if (likely(adapter-itr_setting  3))
diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index 4a6fc74..2ab3bfb 100644
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -1384,7 +1384,7 @@ static int e1000_clean(struct napi_struct *napi, int 
budget)
 {
struct e1000_adapter *adapter = container_of(napi, struct 
e1000_adapter, napi);
struct net_device *poll_dev = adapter-netdev;
-   int work_done = 0;
+   int tx_cleaned = 0, work_done = 0;
 
/* Must NOT use netdev_priv macro here. */
adapter = poll_dev-priv;
@@ -1394,12 +1394,15 @@ static int e1000_clean(struct napi_struct *napi, int 
budget)
 * simultaneously.  A failure obtaining the lock means
 * tx_ring is currently being cleaned anyway. */
if 

Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-14 Thread David Miller
From: Frans Pop [EMAIL PROTECTED]
Date: Tue, 15 Jan 2008 06:25:10 +0100

 kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang

Does this make the problem go away?

(Note this isn't the final correct patch we should apply.  There
 is no reason why this revert back to the older -poll() logic
 here should have any effect on the TX hang triggering...)

diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 13d57b0..cada32c 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -3919,7 +3919,7 @@ e1000_clean(struct napi_struct *napi, int budget)
 {
struct e1000_adapter *adapter = container_of(napi, struct 
e1000_adapter, napi);
struct net_device *poll_dev = adapter-netdev;
-   int work_done = 0;
+   int tx_work = 0, work_done = 0;
 
/* Must NOT use netdev_priv macro here. */
adapter = poll_dev-priv;
@@ -3929,8 +3929,8 @@ e1000_clean(struct napi_struct *napi, int budget)
 * simultaneously.  A failure obtaining the lock means
 * tx_ring[0] is currently being cleaned anyway. */
if (spin_trylock(adapter-tx_queue_lock)) {
-   e1000_clean_tx_irq(adapter,
-  adapter-tx_ring[0]);
+   tx_work = e1000_clean_tx_irq(adapter,
+adapter-tx_ring[0]);
spin_unlock(adapter-tx_queue_lock);
}
 
@@ -3938,7 +3938,7 @@ e1000_clean(struct napi_struct *napi, int budget)
  work_done, budget);
 
/* If budget not fully consumed, exit the polling mode */
-   if (work_done  budget) {
+   if (!tx_work  (work_done  budget)) {
if (likely(adapter-itr_setting  3))
e1000_set_itr(adapter);
netif_rx_complete(poll_dev, napi);
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-14 Thread Frans Pop
After compiling v2.6.24-rc7-163-g1a1b285 (x86_64) yesterday I suddenly see this 
error
repeatedly:
kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
kernel:   Tx Queue 0
kernel:   TDH  a
kernel:   TDT  a
kernel:   next_to_use  a
kernel:   next_to_cleanff
kernel: buffer_info[next_to_clean]
kernel:   time_stamp   10002738a
kernel:   next_to_watchff
kernel:   jiffies  1000275b4
kernel:   next_to_watch.status 1

My previous kernel was v2.6.24-rc7 and with that this error did not occur. I
have also never seen it with earlier kernels.

The values for TX Queue and next_to_watch.status are constant, the
others vary.

My NIC is:
01:00.0 Ethernet controller [0200]: Intel Corporation 82573E Gigabit Ethernet 
Controller (Copper) (rev 03)

01:00.0 0200: 8086:108c (rev 03)
Subsystem: 8086:3096
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- 
MAbort- SERR- PERR-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 1273
Region 0: Memory at 9020 (32-bit, non-prefetchable) [size=128K]
Region 1: Memory at 9010 (32-bit, non-prefetchable) [size=1M]
Region 2: I/O ports at 1000 [size=32]
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA 
PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 
Enable+
Address: fee0300c  Data: 41a9
Capabilities: [e0] Express Endpoint IRQ 0
Device: Supported: MaxPayload 256 bytes, PhantFunc 0, ExtTag-
Device: Latency L0s 512ns, L1 64us
Device: AtnBtn- AtnInd- PwrInd-
Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
Device: RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
Device: MaxPayload 128 bytes, MaxReadReq 512 bytes
Link: Supported Speed 2.5Gb/s, Width x1, ASPM unknown, Port 0
Link: Latency L0s 128ns, L1 64us
Link: ASPM Disabled RCB 64 bytes CommClk+ ExtSynch-
Link: Speed 2.5Gb/s, Width x1

The system is an Intel D945GCZ main board with
Intel(R) Pentium(R) D CPU 3.20GHz (dual core) processor.

Cheers,
FJP
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-14 Thread Frans Pop
Wow. That's fast! :-)

On Tuesday 15 January 2008, David Miller wrote:
 From: Frans Pop [EMAIL PROTECTED]

  kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang

 Does this make the problem go away?

I'm compiling a kernel with the patch now. Will let you know the result.
May take a while as I don't know how to trigger the bug, so I'll just have 
to let it run for some time.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 Detected Tx Unit Hang

2006-09-16 Thread Paul Aviles
Jesse, today the server froze and was not able to see anything in the logs. 
Nothing at all about any error, just plain froze.  Just in case, this is a 
different unit altogether, still the same model as the units having the Tx 
Unit Hang, but different memory, motherboard and CPU. The only 1 thing that 
is the same is the hard drive a regular IDE...


The only one thing I noticed that is very weird to me at least is that in 
powering off the unit from the crash and rebooting it I saw some lines like 
this in the logs..


Sep 16 11:08:03 www kernel: checking if image is initramfs... it is
Sep 16 07:05:19 www sysctl: kernel.msgmnb = 65536

The odd part is the diff in the time stamps between one entry and the very 
next one in the log. Any ideas what can cause this? Also, any way to get a 
dump or some way to prevent the system from locking without any log entries?


Regards,

Paul

- Original Message - 
From: Jesse Brandeburg [EMAIL PROTECTED]

To: Paul Aviles [EMAIL PROTECTED]
Cc: netdev@vger.kernel.org
Sent: Tuesday, September 05, 2006 12:09 PM
Subject: Re: e1000 Detected Tx Unit Hang



On 9/3/06, Paul Aviles [EMAIL PROTECTED] wrote:

Hey Jesse, thanks for your reply. Here is the stuff on /procs. The weird

no problem,


part is that I have several other identical systems and only one is
affected. Today I moved the hard drive to another similar system and I am
not seeing the problem so I am wondering if is something maybe wrong with
the card eeprom? Is there a way to check that?


I doubt it is an eeprom problem.  you can dump the eeproms with
ethtool -e eth0 from both machines and compare them .  Odd that only
one system is having the problem.  Could it be that the hardware on
that box is having issues?  Are you sure the machines are running the
same bios version with the same settings?  Any overclocking?


 cat /proc/interrupts
   CPU0   CPU1
 16:  70540  0   IO-APIC-level  uhci_hcd:usb4, eth0


this could contribute to your problem, were you able to test without NAPI?

Jesse
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html





-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 Detected Tx Unit Hang

2006-09-10 Thread Paul Aviles

Jesse, testing without NAPI, will see how it behaves.

Paul Aviles

- Original Message - 
From: Jesse Brandeburg [EMAIL PROTECTED]

To: Paul Aviles [EMAIL PROTECTED]
Cc: netdev@vger.kernel.org
Sent: Tuesday, September 05, 2006 12:09 PM
Subject: Re: e1000 Detected Tx Unit Hang



On 9/3/06, Paul Aviles [EMAIL PROTECTED] wrote:

Hey Jesse, thanks for your reply. Here is the stuff on /procs. The weird

no problem,


part is that I have several other identical systems and only one is
affected. Today I moved the hard drive to another similar system and I am
not seeing the problem so I am wondering if is something maybe wrong with
the card eeprom? Is there a way to check that?


I doubt it is an eeprom problem.  you can dump the eeproms with
ethtool -e eth0 from both machines and compare them .  Odd that only
one system is having the problem.  Could it be that the hardware on
that box is having issues?  Are you sure the machines are running the
same bios version with the same settings?  Any overclocking?


 cat /proc/interrupts
   CPU0   CPU1
 16:  70540  0   IO-APIC-level  uhci_hcd:usb4, eth0


this could contribute to your problem, were you able to test without NAPI?

Jesse





-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 Detected Tx Unit Hang

2006-09-05 Thread Jesse Brandeburg

On 9/3/06, Paul Aviles [EMAIL PROTECTED] wrote:

Hey Jesse, thanks for your reply. Here is the stuff on /procs. The weird

no problem,


part is that I have several other identical systems and only one is
affected. Today I moved the hard drive to another similar system and I am
not seeing the problem so I am wondering if is something maybe wrong with
the card eeprom? Is there a way to check that?


I doubt it is an eeprom problem.  you can dump the eeproms with
ethtool -e eth0 from both machines and compare them .  Odd that only
one system is having the problem.  Could it be that the hardware on
that box is having issues?  Are you sure the machines are running the
same bios version with the same settings?  Any overclocking?


 cat /proc/interrupts
   CPU0   CPU1
 16:  70540  0   IO-APIC-level  uhci_hcd:usb4, eth0


this could contribute to your problem, were you able to test without NAPI?

Jesse
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 Detected Tx Unit Hang

2006-09-05 Thread Paul Aviles
I haven't done the NAPI yet. These are identical systems altogether, maybe 
the CPU is a different stepping at the most, but that is all.
The 16:  70540  0   IO-APIC-level  uhci_hcd:usb4, eth0 is the 
same in every GS12 I have. No overclocking and same BIOS. Tyan released  ver 
1.8 about a month ago and I did the upgrade and same effect. Then I thought 
about upgrading to 2.6.17.11 just to see if the driver will have any issues 
and nothing, same deal. The only way I was able to control it was usign a 
dummy 10/100 non-management switch. Then we had no issues.


I will try without NAPI tomorrow 9-6-06 and will report back. My 
understanding on NAPI was that it will drop packets by design on overload. 
Why will that cause a system lock?


Are there any other kernel options you would like to enable to track this 
better and if you need remote access to the system I can accomodate too, 
just let me know what time zone you are to schedule it. Let me know.


Regards,

Paul Aviles

- Original Message - 
From: Jesse Brandeburg [EMAIL PROTECTED]

To: Paul Aviles [EMAIL PROTECTED]
Cc: netdev@vger.kernel.org
Sent: Tuesday, September 05, 2006 12:09 PM
Subject: Re: e1000 Detected Tx Unit Hang



On 9/3/06, Paul Aviles [EMAIL PROTECTED] wrote:

Hey Jesse, thanks for your reply. Here is the stuff on /procs. The weird

no problem,


part is that I have several other identical systems and only one is
affected. Today I moved the hard drive to another similar system and I am
not seeing the problem so I am wondering if is something maybe wrong with
the card eeprom? Is there a way to check that?


I doubt it is an eeprom problem.  you can dump the eeproms with
ethtool -e eth0 from both machines and compare them .  Odd that only
one system is having the problem.  Could it be that the hardware on
that box is having issues?  Are you sure the machines are running the
same bios version with the same settings?  Any overclocking?


 cat /proc/interrupts
   CPU0   CPU1
 16:  70540  0   IO-APIC-level  uhci_hcd:usb4, eth0


this could contribute to your problem, were you able to test without NAPI?

Jesse





-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 Detected Tx Unit Hang

2006-09-03 Thread Jesse Brandeburg

On 9/2/06, Paul Aviles [EMAIL PROTECTED] wrote:

I am getting e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang  using
stock 2.6.17.11, 2.6.17.5 or 2.6.17.4 kernels on centos 4.3.

 The server is a Tyan GS12 ( 82541GI/PI and 82547GI) and is connected to a
Netgear GS724T Gig  switch. I can easily reproduce the problem by trying to
do a large ftp transfer to the server. It does not happen if the server is
connected to a dummy 100 Mb switch, only when is connected to the Gig
switch.
I have also tried the options line below disabling tso, tx and rx in the
modprobe.conf without any luck.


Hi Paul, sorry to hear about your problem.  You're getting hangs on
the 82547 right?  can you send the output of cat /proc/interrupts.
I'm curious if you are sharing interrupts while running NAPI.

Also, please try the driver without CONFIG_E1000_NAPI enabled in your
kernel .config, and let us know the results.

Someone has posted (what they think is) a theoretical problem with
irq_sem on the 82547 at e1000.sf.net and I haven't had a chance to
figure it out yet.

Jesse

--
VGER BF report: U 0.495355
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 Detected Tx Unit Hang

2006-09-03 Thread Paul Aviles
Hey Jesse, thanks for your reply. Here is the stuff on /procs. The weird 
part is that I have several other identical systems and only one is 
affected. Today I moved the hard drive to another similar system and I am 
not seeing the problem so I am wondering if is something maybe wrong with 
the card eeprom? Is there a way to check that?


Regards,

Paul

cat /proc/interrupts
  CPU0   CPU1
 0:7716253  0IO-APIC-edge  timer
 3:  11538  0IO-APIC-edge  serial
 8:  1  0IO-APIC-edge  rtc
 9:  0  0   IO-APIC-level  acpi
14:  93406  0IO-APIC-edge  ide0
16:  70540  0   IO-APIC-level  uhci_hcd:usb4, eth0
17:  2  0   IO-APIC-level  ehci_hcd:usb1
18:  0  0   IO-APIC-level  uhci_hcd:usb2, uhci_hcd:usb5
19: 90  0   IO-APIC-level  uhci_hcd:usb3
NMI:  0  0
LOC:77158397715838
ERR:  0
MIS:  0

- Original Message - 
From: Jesse Brandeburg [EMAIL PROTECTED]

To: Paul Aviles [EMAIL PROTECTED]
Cc: netdev@vger.kernel.org
Sent: Sunday, September 03, 2006 1:45 PM
Subject: Re: e1000 Detected Tx Unit Hang



On 9/2/06, Paul Aviles [EMAIL PROTECTED] wrote:
I am getting e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang 
using

stock 2.6.17.11, 2.6.17.5 or 2.6.17.4 kernels on centos 4.3.

 The server is a Tyan GS12 ( 82541GI/PI and 82547GI) and is connected to 
a
Netgear GS724T Gig  switch. I can easily reproduce the problem by trying 
to
do a large ftp transfer to the server. It does not happen if the server 
is

connected to a dummy 100 Mb switch, only when is connected to the Gig
switch.
I have also tried the options line below disabling tso, tx and rx in the
modprobe.conf without any luck.


Hi Paul, sorry to hear about your problem.  You're getting hangs on
the 82547 right?  can you send the output of cat /proc/interrupts.
I'm curious if you are sharing interrupts while running NAPI.

Also, please try the driver without CONFIG_E1000_NAPI enabled in your
kernel .config, and let us know the results.

Someone has posted (what they think is) a theoretical problem with
irq_sem on the 82547 at e1000.sf.net and I haven't had a chance to
figure it out yet.

Jesse

--
VGER BF report: U 0.495355
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html






--
VGER BF report: U 0.516297
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


e1000 Detected Tx Unit Hang

2006-09-02 Thread Paul Aviles
I am getting e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang  using 
stock 2.6.17.11, 2.6.17.5 or 2.6.17.4 kernels on centos 4.3.


The server is a Tyan GS12 ( 82541GI/PI and 82547GI) and is connected to a 
Netgear GS724T Gig  switch. I can easily reproduce the problem by trying to 
do a large ftp transfer to the server. It does not happen if the server is 
connected to a dummy 100 Mb switch, only when is connected to the Gig 
switch.
I have also tried the options line below disabling tso, tx and rx in the 
modprobe.conf without any luck.


options e1000 XsumRX=0 Speed=1000 Duplex=2 InterruptThrottleRate=0 
FlowControl=3 RxDescriptors=4096 TxDescriptors=4096 RxIntDelay=0 
TxIntDelay=0


in /var/log/kernel I get the following...

Sep  1 23:53:01 www kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx 
Unit Hang

Sep  1 23:53:01 www kernel:   Tx Queue 0
Sep  1 23:53:01 www kernel:   TDH  4c4
Sep  1 23:53:01 www kernel:   TDT  4c9
Sep  1 23:53:01 www kernel:   next_to_use  4c9
Sep  1 23:53:01 www kernel:   next_to_clean4c4
Sep  1 23:53:01 www kernel: buffer_info[next_to_clean]
Sep  1 23:53:01 www kernel:   time_stamp   9c60
Sep  1 23:53:01 www kernel:   next_to_watch4c4
Sep  1 23:53:01 www kernel:   jiffies  9d96
Sep  1 23:53:01 www kernel:   next_to_watch.status 0
.
repeats the same as above a few times
.
Sep  1 23:53:10 www kernel: NETDEV WATCHDOG: eth0: transmit timed out
Sep  1 23:53:13 www kernel: e1000: eth0: e1000_watchdog_task: NIC Link
is Up 1000 Mbps Full Duplex

then the server locks up, no response from the keyboard at all and must be 
forced down with a power kill. The suggested tips on how to deal with this 
issue are not working so if I can help troubleshoot this let me know.


Here is my system info,

driver: e1000
version: 7.0.33-k2-NAPI
firmware-version: N/A
bus-info: :02:01.0

lspci -vv output below..

00:00.0 Host bridge: Intel Corporation 82875P/E7210 Memory Controller Hub 
(rev 02)

   Subsystem: Intel Corporation 82875P/E7210 Memory Controller Hub
   Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B-
   Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=fast TAbort- 
TAbort- MAbort+ SERR- PERR-

   Latency: 0
   Region 0: Memory at 9000 (32-bit, prefetchable) [size=128M]
   Capabilities: [e4] Vendor Specific Information
   Capabilities: [a0] AGP version 3.0
   Status: RQ=32 Iso- ArqSz=2 Cal=0 SBA+ ITACoh- GART64- 
HTrans- 64bit- FW+ AGP3- Rate=x1,x2,x4
   Command: RQ=1 ArqSz=0 Cal=0 SBA- AGP- GART64- 64bit- FW- 
Rate=none


00:01.0 PCI bridge: Intel Corporation 82875P Processor to AGP Controller 
(rev 02) (prog-if 00 [Normal decode])
   Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B-
   Status: Cap- 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=fast TAbort- 
TAbort- MAbort- SERR- PERR-

   Latency: 64
   Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
   Secondary status: 66Mhz+ FastB2B+ ParErr- DEVSEL=medium TAbort- 
TAbort- MAbort+ SERR- PERR-

   BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- Reset- FastB2B-

00:03.0 PCI bridge: Intel Corporation 82875P/E7210 Processor to PCI to CSA 
Bridge (rev 02) (prog-if 00 [Normal decode])
   Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B-
   Status: Cap- 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=fast TAbort- 
TAbort- MAbort- SERR- PERR-

   Latency: 32
   Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
   I/O behind bridge: 2000-2fff
   Memory behind bridge: fc10-fc1f
   Secondary status: 66Mhz+ FastB2B+ ParErr- DEVSEL=medium TAbort- 
TAbort- MAbort- SERR- PERR-

   BridgeCtl: Parity- SERR- NoISA+ VGA- MAbort- Reset- FastB2B-

00:1d.0 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI 
Controller #1 (rev 02) (prog-if 00 [UHCI])

   Subsystem: Intel Corporation: Unknown device 24c0
   Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B-
   Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium TAbort- 
TAbort- MAbort- SERR- PERR-

   Latency: 0
   Interrupt: pin A routed to IRQ 18
   Region 4: I/O ports at 1400 [size=32]

00:1d.1 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI 
Controller #2 (rev 02) (prog-if 00 [UHCI])

   Subsystem: Intel Corporation: Unknown device 24c0
   Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B-
   Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium TAbort- 
TAbort- MAbort- SERR- PERR-

   Latency: 0
   Interrupt: pin B routed to IRQ 19
   Region 4: I/O ports at 1420 [size=32]

00:1d.2 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI 
Controller #3 (rev 

Re: e1000 Detected Tx Unit Hang

2006-09-01 Thread Auke Kok

Paul Aviles wrote:
I am getting e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang 
using stock 2.6.17.11, 2.6.17.5 or 2.6.17.4 kernels on centos 4.3.


The server is a Tyan GS10 and is connected to a Netgear GS724T Gig 
switch. I can easily reproduce the problem by trying to do a large ftp 
transfer to the server. It does not happen if the server is connected to 
a dummy 100 Mb switch, only when is connected to the Gig switch.
I have also tried the options line below disabling tso, tx and rx in the 
modprobe.conf without any luck.


options e1000 XsumRX=0 Speed=1000 Duplex=2 InterruptThrottleRate=0 
FlowControl=3 RxDescriptors=4096 TxDescriptors=4096 RxIntDelay=0 
TxIntDelay=0


in /var/log/kernel I get the following...

Sep  1 23:53:01 www kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx 
Unit Hang

Sep  1 23:53:01 www kernel:   Tx Queue 0
Sep  1 23:53:01 www kernel:   TDH  4c4
Sep  1 23:53:01 www kernel:   TDT  4c9
Sep  1 23:53:01 www kernel:   next_to_use  4c9
Sep  1 23:53:01 www kernel:   next_to_clean4c4
Sep  1 23:53:01 www kernel: buffer_info[next_to_clean]
Sep  1 23:53:01 www kernel:   time_stamp   9c60
Sep  1 23:53:01 www kernel:   next_to_watch4c4
Sep  1 23:53:01 www kernel:   jiffies  9d96
Sep  1 23:53:01 www kernel:   next_to_watch.status 0
.
repeats the same as above a few times
.
Sep  1 23:53:10 www kernel: NETDEV WATCHDOG: eth0: transmit timed out
Sep  1 23:53:13 www kernel: e1000: eth0: e1000_watchdog_task: NIC Link 
is Up 1000 Mbps Full Duplex


then the server locks up, no response from the keyboard at all and must 
be forced down with a power kill.


Here is my driver info,

driver: e1000
version: 7.0.33-k2-NAPI
firmware-version: N/A
bus-info: :02:01.0

What else could I check?


[adding netdev to cc, this is a NET issue]

This is a known issue and there are several discussions and bugs filed on this. 
 Please read this one where most is documented, and also the netdev


http://sourceforge.net/tracker/index.php?func=detailaid=1463045group_id=42302atid=447449

more links and information available on http://e1000.sf.net/

Your debugging information might be needed and helpful, so please take the 
trouble of digging in the previous bugreports and reporting anything that might 
be relevant there.


The full lockup is certainly not good, but should not necessarily be related to 
the tx hang (or the cause of that). It is likely that interrupt sharing might 
be a problem here; what kind of e1000 nic is this? lspci -vv?


Cheers,

Auke

--
VGER BF report: H 0.00334085
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html