Re: RFC/T: Possible fix for bcm43xx periodic work bug

2006-09-11 Thread Erik Mouw
On Fri, Sep 08, 2006 at 08:36:36AM -0500, Larry Finger wrote:
 Erik Mouw wrote:
  Thanks for the information, pulled wireless-2.6 and recompiling kernel.
  If this really fixes the problem, can we try to get it merged before
  2.6.18 closes? I don't know if vanilla 2.6.18-rc6 locks up on other
  hardware as well, but if it does it would be a major regression against
  2.6.17.
 
 I'm trying. At the moment, I'm testing a patch against 2.6.18-rc6. Once it 
 works here, I'll be 
 putting it out for testing.

I suppose that's the patch in [PATCH] fix for system lockups in
2.6.18-rcX caused by bcm43xx?

FWIW: 2.6.18-rc6+wireless-2.6 runs well, my laptop didn't lock up over
the weekend.


Erik

-- 
+-- Erik Mouw -- www.harddisk-recovery.com -- +31 70 370 12 90 --
| Lab address: Delftechpark 26, 2628 XH, Delft, The Netherlands
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC/T: Possible fix for bcm43xx periodic work bug

2006-09-11 Thread Larry Finger

Erik Mouw wrote:

On Fri, Sep 08, 2006 at 08:36:36AM -0500, Larry Finger wrote:

Erik Mouw wrote:

Thanks for the information, pulled wireless-2.6 and recompiling kernel.
If this really fixes the problem, can we try to get it merged before
2.6.18 closes? I don't know if vanilla 2.6.18-rc6 locks up on other
hardware as well, but if it does it would be a major regression against
2.6.17.
I'm trying. At the moment, I'm testing a patch against 2.6.18-rc6. Once it works here, I'll be 
putting it out for testing.


I suppose that's the patch in [PATCH] fix for system lockups in
2.6.18-rcX caused by bcm43xx?


Yes.


FWIW: 2.6.18-rc6+wireless-2.6 runs well, my laptop didn't lock up over
the weekend.


It seems to hurt some hardware much worse than others. My BCM4306 Rev 2 could
not run without failure longer than ~6 hours, and sometimes as short as 10 
minutes,
before the patch. I've now completed the equivalent of 3 weeks without a failure
(I'm running the preemptible periodic work every second instead of every 
minute).

Larry
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC/T: Possible fix for bcm43xx periodic work bug

2006-09-08 Thread Erik Mouw
On Thu, Sep 07, 2006 at 01:17:05PM -0500, Larry Finger wrote:
 I think I have a fix for the bcm43xx bug that leads to NETDEV WATCHDOG tx 
 timeouts and would like it
 to get as much testing as possible as this bug affects V2.6.18-rcX. If the 
 problem is truly
 fixed, I hope to get the fix into mainline before release of the bug into the 
 stable series.

FWIW, I finally tracked down the bug that hangs my laptop to the
bcm43xx driver. At first I got the impression it was the cpufreq code
(which has been flaky in early 2.6.18-rc kernels), but after disabling
that my laptop still crashed. After that I disabled preempt cause I got
a couple of lockdep warnings when I had it enabled. That didn't make a
difference, laptop still hangs after some time (runs a couple of hours
at most). Right now I'm on wired network and my laptop finally doesn't
hang anymore (up for two days).

The hang is very hard to trigger (i.e.: it happens at random, I see no
pattern) and locks up the machine completely. I've tried to capture
kernel messages through serial console, but that doesn't work (lock up
before any messages are printed).

This is with any 2.6.18-rc kernel without additional patches or
proprietary modules, 2.6.17 works ok.

Output from lspci:

:06:00.0 Network controller: Broadcom Corporation BCM4318 [AirForce
One 54g] 802.11g Wireless LAN Controller (rev 02)
Subsystem: Linksys WPC54G-EU version 3 [Wireless-G Notebook Adapter]
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B-
Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- 
MAbort- SERR- PERR-
Latency: 64
Interrupt: pin A routed to IRQ 10
Region 0: Memory at 3600 (32-bit, non-prefetchable) [size=8K]
00: e4 14 18 43 06 00 00 00 02 00 80 02 00 40 00 00
10: 00 00 00 36 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 37 17 48 00
30: 00 00 00 00 00 00 00 00 00 00 00 00 0a 01 00 00

Is the patch you just posted supposed to fix the kind of problems I
have, or do I have to look elsewhere?


Erik

-- 
+-- Erik Mouw -- www.harddisk-recovery.com -- +31 70 370 12 90 --
| Lab address: Delftechpark 26, 2628 XH, Delft, The Netherlands
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC/T: Possible fix for bcm43xx periodic work bug

2006-09-08 Thread Michael Buesch
On Friday 08 September 2006 11:42, Erik Mouw wrote:
 On Thu, Sep 07, 2006 at 01:17:05PM -0500, Larry Finger wrote:
  I think I have a fix for the bcm43xx bug that leads to NETDEV WATCHDOG tx 
  timeouts and would like it
  to get as much testing as possible as this bug affects V2.6.18-rcX. If the 
  problem is truly
  fixed, I hope to get the fix into mainline before release of the bug into 
  the stable series.
 
 FWIW, I finally tracked down the bug that hangs my laptop to the
 bcm43xx driver. At first I got the impression it was the cpufreq code
 (which has been flaky in early 2.6.18-rc kernels), but after disabling
 that my laptop still crashed. After that I disabled preempt cause I got
 a couple of lockdep warnings when I had it enabled. That didn't make a
 difference, laptop still hangs after some time (runs a couple of hours
 at most). Right now I'm on wired network and my laptop finally doesn't
 hang anymore (up for two days).
 
 The hang is very hard to trigger (i.e.: it happens at random, I see no
 pattern) and locks up the machine completely. I've tried to capture
 kernel messages through serial console, but that doesn't work (lock up
 before any messages are printed).
 
 This is with any 2.6.18-rc kernel without additional patches or
 proprietary modules, 2.6.17 works ok.

The crash is fixed in wireless-2.6.
The actual cause of the controller restart not. So at least it
does not crash anymore.

-- 
Greetings Michael.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC/T: Possible fix for bcm43xx periodic work bug

2006-09-08 Thread Michael Buesch
On Friday 08 September 2006 15:25, Larry Finger wrote:
 Michael Buesch wrote:
  On Thursday 07 September 2006 20:17, Larry Finger wrote:
  Hi all,
 
  I think I have a fix for the bcm43xx bug that leads to NETDEV WATCHDOG tx 
  timeouts and would like it
  to get as much testing as possible as this bug affects V2.6.18-rcX. If the 
  problem is truly
  fixed, I hope to get the fix into mainline before release of the bug into 
  the stable series.
 
  I got the idea for the fix when I discovered that the timeout interval 
  used for the watchdog is the 
  default value of 5 seconds. Obviously, the few milliseconds used in the 
  periodic work handler 
  weren't causing us to just miss.
 
  To exacerbate the problem, I changed the repeat timer for periodic work 
  from 15 to 1 sec. I also set 
  BADNESS_LIMIT to 0. As a result, I was running the problem code once per 
  second instead of once per 
  minute. Now failures would occur in minutes instead of hours.
 
  Operating from the premise that the DMA needed some time to reach the idle 
  state after the MAC was 
  suspended, I tried various delays, but nothing worked.
 
  Then I decided to test the premise that the problem was associated with 
  shutting down and restarting 
  the network. That lead to the current patch, which has run for what is 
  effectively 100 times longer 
  than previous versions.
 
  Larry
  ---
 
 
  Index: wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
  ===
  --- wireless-2.6.orig/drivers/net/wireless/bcm43xx/bcm43xx_main.c
  +++ wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
  @@ -3244,8 +3244,6 @@ static void bcm43xx_periodic_work_handle
  * be preemtible.
  */
 mutex_lock(bcm-mutex);
  -  netif_stop_queue(bcm-net_dev);
  -  synchronize_net();
 spin_lock_irqsave(bcm-irq_lock, flags);
 bcm43xx_mac_suspend(bcm);
 if (bcm43xx_using_pio(bcm))
  @@ -3270,7 +3268,6 @@ static void bcm43xx_periodic_work_handle
 if (bcm43xx_using_pio(bcm))
 bcm43xx_pio_thaw_txqueues(bcm);
 bcm43xx_mac_enable(bcm);
  -  netif_wake_queue(bcm-net_dev);
 }
 mmiowb();
 spin_unlock_irqrestore(bcm-irq_lock, flags);
  
  The real question is: Why does this patch help?
  
  Let's explain it. We don't stop networking just for fun there.
  While executing long preemptible periodic work, we must ensure
  that the TX path into the driver is not entered. It's the same
  reason why we disable IRQs in the first place. We can't take the
  mutex in the TX path and the IRQ handler. (That are the only places
  where we can't take the mutex).
  Short: We must stop netif here.
  The question is: Why does stopping netif queue cause a watchdog
  trigger here? The maximum time it can take for the periodic
  work inside of the critical section is about 0.2sec. So the queue
  is stopped for about 0.2sec max. Why does the watchdog trigger?
  Any idea from some networking guru?
  Could synchronize_net() take over 5sec in some worst case? Why?
  Questions over questions :D
 
 This may be a stupid question, but does the synchronize_net call belong?
 
 The reason I ask is because the code for synchronize_net is

synchronize_net() ensures that all currently running TX handlers
complete before returning from synchronize_net(). That's what I have
been told.
So what we do is: Disable TX queue and wait for any running TX queue
to finish. We must do this to make sure no TX handler can run after
the sync. I previously explained the exact reasons.

 When I look through the mutex_lock code, I don't find any rcu code. What did 
 I miss?

This is bot about synchronoizing bcm43xx, but the net layer.

 BTW, I still  
 got NETDEV tx timeouts with a 30 second timeout.

So, well. I don't think synchronize_net can take 30 seconds.
That would be a big bug in the net layer.

 I'm currently testing with the netif_stop_queue and netif_wake_queue 
 statements restored, but 
 without the synchronize_net. It has run for over 11 hours with the 
 accelerated testing, which would 
 correspond to almost 4 weeks at regular rates.

Well, we can brute-force it to death, or we can ask someone who actually has
a clue what is going on. Some networking guru, please help. :)

-- 
Greetings Michael.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC/T: Possible fix for bcm43xx periodic work bug

2006-09-08 Thread Erik Mouw
On Fri, Sep 08, 2006 at 11:45:27AM +0200, Michael Buesch wrote:
 The crash is fixed in wireless-2.6.
 The actual cause of the controller restart not. So at least it
 does not crash anymore.

Thanks for the information, pulled wireless-2.6 and recompiling kernel.
If this really fixes the problem, can we try to get it merged before
2.6.18 closes? I don't know if vanilla 2.6.18-rc6 locks up on other
hardware as well, but if it does it would be a major regression against
2.6.17.


Erik

-- 
+-- Erik Mouw -- www.harddisk-recovery.com -- +31 70 370 12 90 --
| Lab address: Delftechpark 26, 2628 XH, Delft, The Netherlands
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC/T: Possible fix for bcm43xx periodic work bug

2006-09-08 Thread Larry Finger

Erik Mouw wrote:

On Fri, Sep 08, 2006 at 11:45:27AM +0200, Michael Buesch wrote:

The crash is fixed in wireless-2.6.
The actual cause of the controller restart not. So at least it
does not crash anymore.


Thanks for the information, pulled wireless-2.6 and recompiling kernel.
If this really fixes the problem, can we try to get it merged before
2.6.18 closes? I don't know if vanilla 2.6.18-rc6 locks up on other
hardware as well, but if it does it would be a major regression against
2.6.17.


I'm trying. At the moment, I'm testing a patch against 2.6.18-rc6. Once it works here, I'll be 
putting it out for testing.


Larry
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC/T: Possible fix for bcm43xx periodic work bug

2006-09-07 Thread Michael Buesch
On Thursday 07 September 2006 20:17, Larry Finger wrote:
 Hi all,
 
 I think I have a fix for the bcm43xx bug that leads to NETDEV WATCHDOG tx 
 timeouts and would like it
 to get as much testing as possible as this bug affects V2.6.18-rcX. If the 
 problem is truly
 fixed, I hope to get the fix into mainline before release of the bug into the 
 stable series.
 
 I got the idea for the fix when I discovered that the timeout interval used 
 for the watchdog is the 
 default value of 5 seconds. Obviously, the few milliseconds used in the 
 periodic work handler 
 weren't causing us to just miss.
 
 To exacerbate the problem, I changed the repeat timer for periodic work from 
 15 to 1 sec. I also set 
 BADNESS_LIMIT to 0. As a result, I was running the problem code once per 
 second instead of once per 
 minute. Now failures would occur in minutes instead of hours.
 
 Operating from the premise that the DMA needed some time to reach the idle 
 state after the MAC was 
 suspended, I tried various delays, but nothing worked.
 
 Then I decided to test the premise that the problem was associated with 
 shutting down and restarting 
 the network. That lead to the current patch, which has run for what is 
 effectively 100 times longer 
 than previous versions.
 
 Larry
 ---
 
 
 Index: wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
 ===
 --- wireless-2.6.orig/drivers/net/wireless/bcm43xx/bcm43xx_main.c
 +++ wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
 @@ -3244,8 +3244,6 @@ static void bcm43xx_periodic_work_handle
* be preemtible.
*/
   mutex_lock(bcm-mutex);
 - netif_stop_queue(bcm-net_dev);
 - synchronize_net();
   spin_lock_irqsave(bcm-irq_lock, flags);
   bcm43xx_mac_suspend(bcm);
   if (bcm43xx_using_pio(bcm))
 @@ -3270,7 +3268,6 @@ static void bcm43xx_periodic_work_handle
   if (bcm43xx_using_pio(bcm))
   bcm43xx_pio_thaw_txqueues(bcm);
   bcm43xx_mac_enable(bcm);
 - netif_wake_queue(bcm-net_dev);
   }
   mmiowb();
   spin_unlock_irqrestore(bcm-irq_lock, flags);

The real question is: Why does this patch help?

Let's explain it. We don't stop networking just for fun there.
While executing long preemptible periodic work, we must ensure
that the TX path into the driver is not entered. It's the same
reason why we disable IRQs in the first place. We can't take the
mutex in the TX path and the IRQ handler. (That are the only places
where we can't take the mutex).
Short: We must stop netif here.
The question is: Why does stopping netif queue cause a watchdog
trigger here? The maximum time it can take for the periodic
work inside of the critical section is about 0.2sec. So the queue
is stopped for about 0.2sec max. Why does the watchdog trigger?
Any idea from some networking guru?
Could synchronize_net() take over 5sec in some worst case? Why?
Questions over questions :D

-- 
Greetings Michael.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC/T: Possible fix for bcm43xx periodic work bug

2006-09-07 Thread Larry Finger

Michael Buesch wrote:


The real question is: Why does this patch help?

Let's explain it. We don't stop networking just for fun there.
While executing long preemptible periodic work, we must ensure
that the TX path into the driver is not entered. It's the same
reason why we disable IRQs in the first place. We can't take the
mutex in the TX path and the IRQ handler. (That are the only places
where we can't take the mutex).
Short: We must stop netif here.
The question is: Why does stopping netif queue cause a watchdog
trigger here? The maximum time it can take for the periodic
work inside of the critical section is about 0.2sec. So the queue
is stopped for about 0.2sec max. Why does the watchdog trigger?
Any idea from some networking guru?
Could synchronize_net() take over 5sec in some worst case? Why?
Questions over questions :D



To check if it takes more than 5 seconds, I restored the original network disabling code and 
increased the timeout to 30 seconds. If this works without error, I'll try to margin the time. I'm 
still running that branch every second.


Larry


Index: wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
===
--- wireless-2.6.orig/drivers/net/wireless/bcm43xx/bcm43xx_main.c
+++ wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
@@ -4147,6 +4147,7 @@ static int __devinit bcm43xx_init_one(st
SET_MODULE_OWNER(net_dev);
SET_NETDEV_DEV(net_dev, pdev-dev);

+   net_dev-watchdog_timeo = 30 * HZ;
net_dev-open = bcm43xx_net_open;
net_dev-stop = bcm43xx_net_stop;
net_dev-get_stats = bcm43xx_net_get_stats;



Index: wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
===
--- wireless-2.6.orig/drivers/net/wireless/bcm43xx/bcm43xx_main.c
+++ wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
@@ -3207,7 +3207,7 @@ static void do_periodic_work(struct bcm4
bcm43xx_periodic_every15sec(bcm);
bcm-periodic_state = state + 1;
 
-   schedule_delayed_work(bcm-periodic_work, HZ * 15);
+   schedule_delayed_work(bcm-periodic_work, HZ * 1);
 }
 
 /* Estimate a Badness value based on the periodic work
@@ -3227,7 +3227,7 @@ static int estimate_periodic_work_badnes
if (state % 1 == 0) /* every 15 sec */
badness += 1;
 
-#define BADNESS_LIMIT  4
+#define BADNESS_LIMIT  0
return badness;
 }
 
@@ -4147,6 +4147,7 @@ static int __devinit bcm43xx_init_one(st
SET_MODULE_OWNER(net_dev);
SET_NETDEV_DEV(net_dev, pdev-dev);
 
+   net_dev-watchdog_timeo = 30 * HZ;
net_dev-open = bcm43xx_net_open;
net_dev-stop = bcm43xx_net_stop;
net_dev-get_stats = bcm43xx_net_get_stats;