Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-20 Thread Ingo Molnar
* Olaf Kirch <[EMAIL PROTECTED]> wrote: > On Thursday 19 July 2007 21:56, Ingo Molnar wrote: > > nope - with this patch applied the box still has no network, symptoms > > are similar. (should i apply the WARN_ON() patch too?) > > Yes, that would be nice. If that doesn't help, you can also

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-20 Thread Ingo Molnar
* Olaf Kirch [EMAIL PROTECTED] wrote: On Thursday 19 July 2007 21:56, Ingo Molnar wrote: nope - with this patch applied the box still has no network, symptoms are similar. (should i apply the WARN_ON() patch too?) Yes, that would be nice. If that doesn't help, you can also throw in the

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Olaf Kirch
On Thursday 19 July 2007 21:56, Ingo Molnar wrote: > nope - with this patch applied the box still has no network, symptoms > are similar. (should i apply the WARN_ON() patch too?) Yes, that would be nice. If that doesn't help, you can also throw in the one below. Olaf -- Olaf Kirch | --- o

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Ingo Molnar
* Olaf Kirch <[EMAIL PROTECTED]> wrote: > Does the following help? > --- build-2.6.orig/drivers/net/netconsole.c > +++ build-2.6/drivers/net/netconsole.c > @@ -70,7 +70,7 @@ static void write_msg(struct console *co > int frag, left; > unsigned long flags; > > - if (!np.dev) >

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Olaf Kirch
Does the following help? Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play [EMAIL PROTECTED] |/ | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax Test patch --- Index: build-2.6/drivers/net/netconsole.c

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Ingo Molnar
* Olaf Kirch <[EMAIL PROTECTED]> wrote: > Here's a somewhat drastic modification that should not change any > timing, but just verifies whether my patch is to blame at all. Can you > give it a try? > @@ -1027,7 +1027,7 @@ static inline void netif_rx_complete(str >* But at least it

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Olaf Kirch
On Thursday 19 July 2007 18:07, Ingo Molnar wrote: > because i dont seem to be able to trigger Olaf's WARN_ON(), can you see > anything in the ethtool output that i sent in the previous mail(s)? If the WARN_ON doesn't trigger, I cannot see how my patch would affect your system. - IF we

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Olaf Kirch
On Thursday 19 July 2007 19:36, Olaf Kirch wrote: > Can you confirm this by spraying the laptop with arp packets > or broadcast pings while it's booting? Sorry for the noise - didn't see your other message where you described just that. This sounds more like a hardware issue - Rx interrupt seems

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Ingo Molnar
* Olaf Kirch <[EMAIL PROTECTED]> wrote: > On Thursday 19 July 2007 18:05, Ingo Molnar wrote: > > that network-intense test also produced periodic broadcast packets that > > got the e1000 out of its weird state before the tx timeout could hit. > > Now that i've stopped the test, the network is

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Olaf Kirch
On Thursday 19 July 2007 18:05, Ingo Molnar wrote: > that network-intense test also produced periodic broadcast packets that > got the e1000 out of its weird state before the tx timeout could hit. > Now that i've stopped the test, the network is quiescent again and the > e1000 hangs. Can you

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Ingo Molnar
* Ingo Molnar <[EMAIL PROTECTED]> wrote: > > i'll now check whether removing ignore_on_loglevel (no other > > changes) makes the hang go away. Maybe ignore_on_loglevel is buggy - > > or it produces an immediate printk (going out to the interface) > > during a particularly sensitive period of

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Ingo Molnar
* Kok, Auke <[EMAIL PROTECTED]> wrote: > > I don't have a fix ready yet - I hope I'll have something later this > > afternoon. > > interesting, you seem to found the cause allright. I can't confirm the > problem but I know that netpoll and NAPI has historically been an > issue. I look

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Ingo Molnar
* Ingo Molnar <[EMAIL PROTECTED]> wrote: > i'll now check whether removing ignore_on_loglevel (no other changes) > makes the hang go away. Maybe ignore_on_loglevel is buggy - or it > produces an immediate printk (going out to the interface) during a > particularly sensitive period of network

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Ingo Molnar
* Ingo Molnar <[EMAIL PROTECTED]> wrote: > ah! Just found the reason: the bug apparently depends on the precise > kernel command-line contents. I accidentally dropped ignore_loglevel > (found this while comparing with the older logs i sent to you), adding > it back in produces hung networking

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Kok, Auke
Olaf Kirch wrote: On Thursday 19 July 2007 12:58, Ingo Molnar wrote: i.e. it's the classic 'eth0 got stuck somehow' tx/rx state machine hickup symptoms, with no other bad symptoms such as lockups or crashes. Duh, I found it. The e1000 poll routine does this to leave polling mode.

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Ingo Molnar
* Ingo Molnar <[EMAIL PROTECTED]> wrote: > ugh. Something really weird happened with this e1000 problem. > > i crashed the laptop in a weird way and had to power-cycle it in an > unusual fashion. After that i wanted to try your latest BUG_ON() > theory but the network hang went away! > > For

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Olaf Kirch
On Thursday 19 July 2007 17:07, Ingo Molnar wrote: > i crashed the laptop in a weird way and had to power-cycle it in an > unusual fashion. After that i wanted to try your latest BUG_ON() theory > but the network hang went away! Should I rejoice, or regret? :-) > maybe it's not the

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Ingo Molnar
ugh. Something really weird happened with this e1000 problem. i crashed the laptop in a weird way and had to power-cycle it in an unusual fashion. After that i wanted to try your latest BUG_ON() theory but the network hang went away! For 3 hours i tried to reproduce the hang (i went back to

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Olaf Kirch
On Thursday 19 July 2007 14:52, Olaf Kirch wrote: > On Thursday 19 July 2007 12:58, Ingo Molnar wrote: > > i.e. it's the classic 'eth0 got stuck somehow' tx/rx state machine > > hickup symptoms, with no other bad symptoms such as lockups or crashes. > > Duh, I found it. The following patch

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Olaf Kirch
On Thursday 19 July 2007 12:58, Ingo Molnar wrote: > i.e. it's the classic 'eth0 got stuck somehow' tx/rx state machine > hickup symptoms, with no other bad symptoms such as lockups or crashes. Duh, I found it. The e1000 poll routine does this to leave polling mode.

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Ingo Molnar
* Ingo Molnar <[EMAIL PROTECTED]> wrote: > * Olaf Kirch <[EMAIL PROTECTED]> wrote: > > > On Thursday 19 July 2007 12:01, Ingo Molnar wrote: > > > Calling initcall 0xc0603f55: netpoll_init+0x0/0x39() > > > initcall 0xc0603f55: netpoll_init+0x0/0x39() returned 0. > > > initcall 0xc0603f55 ran

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Ingo Molnar
* Olaf Kirch <[EMAIL PROTECTED]> wrote: > On Thursday 19 July 2007 12:01, Ingo Molnar wrote: > > Calling initcall 0xc0603f55: netpoll_init+0x0/0x39() > > initcall 0xc0603f55: netpoll_init+0x0/0x39() returned 0. > > initcall 0xc0603f55 ran for 0 msecs: netpoll_init+0x0/0x39() > > Calling

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Olaf Kirch
On Thursday 19 July 2007 12:01, Ingo Molnar wrote: > Calling initcall 0xc0603f55: netpoll_init+0x0/0x39() > initcall 0xc0603f55: netpoll_init+0x0/0x39() returned 0. > initcall 0xc0603f55 ran for 0 msecs: netpoll_init+0x0/0x39() > Calling initcall 0xc0604257: netlink_proto_init+0x0/0x12a() >

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Ingo Molnar
* Ingo Molnar <[EMAIL PROTECTED]> wrote: > the e1000 in this laptop is historically pretty robust. The only > problem i ever had with it were some rx/tx hw-engine latency problems > [pings from the outside took up to 1 second to propagate] that were > quickly fixed by the e1000 driver guys.

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Ingo Molnar
* Olaf Kirch <[EMAIL PROTECTED]> wrote: > -You say that netconsole output continues to trickle after > the network gets wedged. This could be caused by the > e1000 watchdog, which triggers a NIC interrupt "to ensure > rx ring is cleaned". I assume that this triggers the >

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Olaf Kirch
On Thursday 19 July 2007 11:09, Ingo Molnar wrote: > the e1000 in this laptop is historically pretty robust. The only problem > i ever had with it were some rx/tx hw-engine latency problems [pings > from the outside took up to 1 second to propagate] that were quickly > fixed by the e1000 driver

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Ingo Molnar
i have your original patch applied to my working tree to be able to observe this bug's behavior, and here's another observation: the problem seems to go away if i turn on CONFIG_NO_HZ. So it looks timing related indeed ... but when the bug happens, it happens all the time, reboot after

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Jarek Poplawski
On Wed, Jul 18, 2007 at 01:48:20PM +0200, Jarek Poplawski wrote: ... > I'd be very glad if it could be verified and/or tested. Jarek, This patch is verified crap! Regards, Jarek P. PS: Olaf, You've written earlier that one of the main reasons for poll_napi is to work when the kernel "doesn't

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Jarek Poplawski
On Wed, Jul 18, 2007 at 01:48:20PM +0200, Jarek Poplawski wrote: ... I'd be very glad if it could be verified and/or tested. Jarek, This patch is verified crap! Regards, Jarek P. PS: Olaf, You've written earlier that one of the main reasons for poll_napi is to work when the kernel doesn't

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Ingo Molnar
i have your original patch applied to my working tree to be able to observe this bug's behavior, and here's another observation: the problem seems to go away if i turn on CONFIG_NO_HZ. So it looks timing related indeed ... but when the bug happens, it happens all the time, reboot after

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Olaf Kirch
On Thursday 19 July 2007 11:09, Ingo Molnar wrote: the e1000 in this laptop is historically pretty robust. The only problem i ever had with it were some rx/tx hw-engine latency problems [pings from the outside took up to 1 second to propagate] that were quickly fixed by the e1000 driver

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Ingo Molnar
* Olaf Kirch [EMAIL PROTECTED] wrote: -You say that netconsole output continues to trickle after the network gets wedged. This could be caused by the e1000 watchdog, which triggers a NIC interrupt to ensure rx ring is cleaned. I assume that this triggers the

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Ingo Molnar
* Ingo Molnar [EMAIL PROTECTED] wrote: the e1000 in this laptop is historically pretty robust. The only problem i ever had with it were some rx/tx hw-engine latency problems [pings from the outside took up to 1 second to propagate] that were quickly fixed by the e1000 driver guys. Maybe

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Olaf Kirch
On Thursday 19 July 2007 12:01, Ingo Molnar wrote: Calling initcall 0xc0603f55: netpoll_init+0x0/0x39() initcall 0xc0603f55: netpoll_init+0x0/0x39() returned 0. initcall 0xc0603f55 ran for 0 msecs: netpoll_init+0x0/0x39() Calling initcall 0xc0604257: netlink_proto_init+0x0/0x12a() NET:

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Ingo Molnar
* Olaf Kirch [EMAIL PROTECTED] wrote: On Thursday 19 July 2007 12:01, Ingo Molnar wrote: Calling initcall 0xc0603f55: netpoll_init+0x0/0x39() initcall 0xc0603f55: netpoll_init+0x0/0x39() returned 0. initcall 0xc0603f55 ran for 0 msecs: netpoll_init+0x0/0x39() Calling initcall

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Ingo Molnar
* Ingo Molnar [EMAIL PROTECTED] wrote: * Olaf Kirch [EMAIL PROTECTED] wrote: On Thursday 19 July 2007 12:01, Ingo Molnar wrote: Calling initcall 0xc0603f55: netpoll_init+0x0/0x39() initcall 0xc0603f55: netpoll_init+0x0/0x39() returned 0. initcall 0xc0603f55 ran for 0 msecs:

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Olaf Kirch
On Thursday 19 July 2007 12:58, Ingo Molnar wrote: i.e. it's the classic 'eth0 got stuck somehow' tx/rx state machine hickup symptoms, with no other bad symptoms such as lockups or crashes. Duh, I found it. The e1000 poll routine does this to leave polling mode.

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Olaf Kirch
On Thursday 19 July 2007 14:52, Olaf Kirch wrote: On Thursday 19 July 2007 12:58, Ingo Molnar wrote: i.e. it's the classic 'eth0 got stuck somehow' tx/rx state machine hickup symptoms, with no other bad symptoms such as lockups or crashes. Duh, I found it. The following patch should

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Ingo Molnar
ugh. Something really weird happened with this e1000 problem. i crashed the laptop in a weird way and had to power-cycle it in an unusual fashion. After that i wanted to try your latest BUG_ON() theory but the network hang went away! For 3 hours i tried to reproduce the hang (i went back to

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Olaf Kirch
On Thursday 19 July 2007 17:07, Ingo Molnar wrote: i crashed the laptop in a weird way and had to power-cycle it in an unusual fashion. After that i wanted to try your latest BUG_ON() theory but the network hang went away! Should I rejoice, or regret? :-) maybe it's not the power-cycling

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Ingo Molnar
* Ingo Molnar [EMAIL PROTECTED] wrote: ugh. Something really weird happened with this e1000 problem. i crashed the laptop in a weird way and had to power-cycle it in an unusual fashion. After that i wanted to try your latest BUG_ON() theory but the network hang went away! For 3 hours

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Kok, Auke
Olaf Kirch wrote: On Thursday 19 July 2007 12:58, Ingo Molnar wrote: i.e. it's the classic 'eth0 got stuck somehow' tx/rx state machine hickup symptoms, with no other bad symptoms such as lockups or crashes. Duh, I found it. The e1000 poll routine does this to leave polling mode.

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Ingo Molnar
* Ingo Molnar [EMAIL PROTECTED] wrote: ah! Just found the reason: the bug apparently depends on the precise kernel command-line contents. I accidentally dropped ignore_loglevel (found this while comparing with the older logs i sent to you), adding it back in produces hung networking too.

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Ingo Molnar
* Ingo Molnar [EMAIL PROTECTED] wrote: i'll now check whether removing ignore_on_loglevel (no other changes) makes the hang go away. Maybe ignore_on_loglevel is buggy - or it produces an immediate printk (going out to the interface) during a particularly sensitive period of network

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Ingo Molnar
* Kok, Auke [EMAIL PROTECTED] wrote: I don't have a fix ready yet - I hope I'll have something later this afternoon. interesting, you seem to found the cause allright. I can't confirm the problem but I know that netpoll and NAPI has historically been an issue. I look forward to your

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Ingo Molnar
* Ingo Molnar [EMAIL PROTECTED] wrote: i'll now check whether removing ignore_on_loglevel (no other changes) makes the hang go away. Maybe ignore_on_loglevel is buggy - or it produces an immediate printk (going out to the interface) during a particularly sensitive period of network

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Olaf Kirch
On Thursday 19 July 2007 18:05, Ingo Molnar wrote: that network-intense test also produced periodic broadcast packets that got the e1000 out of its weird state before the tx timeout could hit. Now that i've stopped the test, the network is quiescent again and the e1000 hangs. Can you

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Ingo Molnar
* Olaf Kirch [EMAIL PROTECTED] wrote: On Thursday 19 July 2007 18:05, Ingo Molnar wrote: that network-intense test also produced periodic broadcast packets that got the e1000 out of its weird state before the tx timeout could hit. Now that i've stopped the test, the network is quiescent

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Olaf Kirch
On Thursday 19 July 2007 19:36, Olaf Kirch wrote: Can you confirm this by spraying the laptop with arp packets or broadcast pings while it's booting? Sorry for the noise - didn't see your other message where you described just that. This sounds more like a hardware issue - Rx interrupt seems

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Olaf Kirch
On Thursday 19 July 2007 18:07, Ingo Molnar wrote: because i dont seem to be able to trigger Olaf's WARN_ON(), can you see anything in the ethtool output that i sent in the previous mail(s)? If the WARN_ON doesn't trigger, I cannot see how my patch would affect your system. - IF we

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Ingo Molnar
* Olaf Kirch [EMAIL PROTECTED] wrote: Here's a somewhat drastic modification that should not change any timing, but just verifies whether my patch is to blame at all. Can you give it a try? @@ -1027,7 +1027,7 @@ static inline void netif_rx_complete(str * But at least it doesn't

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Olaf Kirch
Does the following help? Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play [EMAIL PROTECTED] |/ | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax Test patch --- Index: build-2.6/drivers/net/netconsole.c

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Ingo Molnar
* Olaf Kirch [EMAIL PROTECTED] wrote: Does the following help? --- build-2.6.orig/drivers/net/netconsole.c +++ build-2.6/drivers/net/netconsole.c @@ -70,7 +70,7 @@ static void write_msg(struct console *co int frag, left; unsigned long flags; - if (!np.dev) + if

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-19 Thread Olaf Kirch
On Thursday 19 July 2007 21:56, Ingo Molnar wrote: nope - with this patch applied the box still has no network, symptoms are similar. (should i apply the WARN_ON() patch too?) Yes, that would be nice. If that doesn't help, you can also throw in the one below. Olaf -- Olaf Kirch | --- o ---

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-18 Thread Ingo Molnar
* Olaf Kirch <[EMAIL PROTECTED]> wrote: > > also, i'm using netconsole via the command line (both the network > > driver and netconsole is built into the bzImage), maybe that makes a > > difference? > > Possibly - but so far there's nothing in the code that jumped at me. > > Can you try the

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-18 Thread Olaf Kirch
On Wednesday 18 July 2007 14:48, Ingo Molnar wrote: > something i noticed: netconsole output seems to trickle through though, > but very, very slowly (a packet once every 4 seconds or so). TCP/IP is > not functional. > > also, i'm using netconsole via the command line (both the network driver

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-18 Thread Ingo Molnar
* Olaf Kirch <[EMAIL PROTECTED]> wrote: > On Tuesday 17 July 2007 20:56, Ingo Molnar wrote: > > i logged these not via netconsole but via logging on over the console > > and using dmesg, so it should include everything. in the 100hz case the > > following seems to show the anomaly: > > > >

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-18 Thread Ingo Molnar
* Olaf Kirch <[EMAIL PROTECTED]> wrote: > On Tuesday 17 July 2007 20:56, Ingo Molnar wrote: > > i logged these not via netconsole but via logging on over the console > > and using dmesg, so it should include everything. in the 100hz case the > > following seems to show the anomaly: > > > >

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-18 Thread Olaf Kirch
On Tuesday 17 July 2007 20:56, Ingo Molnar wrote: > i logged these not via netconsole but via logging on over the console > and using dmesg, so it should include everything. in the 100hz case the > following seems to show the anomaly: > > NETDEV WATCHDOG: eth0: transmit timed out So, it

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-18 Thread Jarek Poplawski
Hi, Here is my proposal of a solution based on dev->state flag, but intended mainly to prevent poll_napi from disturbing while net_rx_action is running and polling the device. It doesn't look very nice or clean but I hope it could guard net_rx_action enough with some room for netpoll too. I'd

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-18 Thread Jarek Poplawski
Hi, Here is my proposal of a solution based on dev-state flag, but intended mainly to prevent poll_napi from disturbing while net_rx_action is running and polling the device. It doesn't look very nice or clean but I hope it could guard net_rx_action enough with some room for netpoll too. I'd be

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-18 Thread Olaf Kirch
On Tuesday 17 July 2007 20:56, Ingo Molnar wrote: i logged these not via netconsole but via logging on over the console and using dmesg, so it should include everything. in the 100hz case the following seems to show the anomaly: NETDEV WATCHDOG: eth0: transmit timed out So, it seems as

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-18 Thread Ingo Molnar
* Olaf Kirch [EMAIL PROTECTED] wrote: On Tuesday 17 July 2007 20:56, Ingo Molnar wrote: i logged these not via netconsole but via logging on over the console and using dmesg, so it should include everything. in the 100hz case the following seems to show the anomaly: NETDEV

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-18 Thread Ingo Molnar
* Olaf Kirch [EMAIL PROTECTED] wrote: On Tuesday 17 July 2007 20:56, Ingo Molnar wrote: i logged these not via netconsole but via logging on over the console and using dmesg, so it should include everything. in the 100hz case the following seems to show the anomaly: NETDEV

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-18 Thread Olaf Kirch
On Wednesday 18 July 2007 14:48, Ingo Molnar wrote: something i noticed: netconsole output seems to trickle through though, but very, very slowly (a packet once every 4 seconds or so). TCP/IP is not functional. also, i'm using netconsole via the command line (both the network driver and

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-18 Thread Ingo Molnar
* Olaf Kirch [EMAIL PROTECTED] wrote: also, i'm using netconsole via the command line (both the network driver and netconsole is built into the bzImage), maybe that makes a difference? Possibly - but so far there's nothing in the code that jumped at me. Can you try the following

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Ingo Molnar
* Olaf Kirch <[EMAIL PROTECTED]> wrote: > On Tuesday 17 July 2007 20:18, Ingo Molnar wrote: > > (one is HZ=100, the other HZ=1000. HZ=100 produces a hung network just > > like HZ=250.) > > > > no 'rx_sched set' messages in either case. Network still hung for > > HZ=100, and is working for

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Olaf Kirch
On Tuesday 17 July 2007 20:18, Ingo Molnar wrote: > (one is HZ=100, the other HZ=1000. HZ=100 produces a hung network just > like HZ=250.) > > no 'rx_sched set' messages in either case. Network still hung for > HZ=100, and is working for HZ=1000. Is this from dmesg or the netconsole output? I

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Ingo Molnar
* Olaf Kirch <[EMAIL PROTECTED]> wrote: > Hi Ingo, > > On Tuesday 17 July 2007 18:57, Ingo Molnar wrote: > > i've done the patch below, but it did not change the timeouts nor did it > > solve the 'no network' problem. netconsole output hung earlier as well. > Hm, pity. > > To rule out any

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Ingo Molnar
* David Miller <[EMAIL PROTECTED]> wrote: > From: Ingo Molnar <[EMAIL PROTECTED]> > Date: Tue, 17 Jul 2007 00:37:18 +0200 > > > I think if you leaned back and thought it through, and if you > > applied this scenario to a bad scheduler commit from me that broke > > your box, you'd readily

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Olaf Kirch
Hi Ingo, On Tuesday 17 July 2007 18:57, Ingo Molnar wrote: > i've done the patch below, but it did not change the timeouts nor did it > solve the 'no network' problem. netconsole output hung earlier as well. Hm, pity. To rule out any e1000 problem, can you try the the following please, both

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Linus Torvalds
On Tue, 17 Jul 2007, Ingo Molnar wrote: > > i've got a new observation: changing CONFIG_HZ from 250 to 1000 makes > the problem go away. So it's somehow also related to jiffies. No, I suspect it's just related to timing: you need to hit that window when the LIST_FROZEN bit is set, and since

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Ingo Molnar
* Olaf Kirch <[EMAIL PROTECTED]> wrote: > Can you try what happens if you change netif_rx_complete to something > like this: > > if (test_bit(__LINK_STATE_POLL_LIST_FROZEN, >state)) { > dev->quota = dev->weight; > return; > } > > This is just a hack to

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Olaf Kirch
On Tuesday 17 July 2007 10:57, Ingo Molnar wrote: > i've got a new observation: changing CONFIG_HZ from 250 to 1000 makes > the problem go away. So it's somehow also related to jiffies. There are several "Tx Hang detected" messages in the log, which looks a lot as if net_rx_action never runs, or

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Jarek Poplawski
On Tue, Jul 17, 2007 at 10:57:48AM +0200, Ingo Molnar wrote: > > Olaf, > > i've got a new observation: changing CONFIG_HZ from 250 to 1000 makes > the problem go away. So it's somehow also related to jiffies. IMHO it could be related with __LINK_STATE_RX_SCHED beeing set too long e.g. between

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Jarek Poplawski
On Tue, Jul 17, 2007 at 10:28:34AM +0200, Olaf Kirch wrote: > On Tuesday 17 July 2007 09:55, Olaf Kirch wrote: > > What I find more problematic about this portion of code though > > is that once a net_device is over quota, net_rx_action will > > loop for up to one jiffy, even if there's just this

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Ingo Molnar
Olaf, i've got a new observation: changing CONFIG_HZ from 250 to 1000 makes the problem go away. So it's somehow also related to jiffies. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Olaf Kirch
On Tuesday 17 July 2007 09:55, Olaf Kirch wrote: > What I find more problematic about this portion of code though > is that once a net_device is over quota, net_rx_action will > loop for up to one jiffy, even if there's just this one device on > the poll_list. Duh, wrong. For every loop, it'll

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Olaf Kirch
On Monday 16 July 2007 23:40, Linus Torvalds wrote: > - The change seems to always set the LIST_FROZEN bit when calling >->poll(), and at least on e1000, the NAPI poll() routine ends up doing >that netif_rx_complete(), so we're *guaranteed* to always take the >early exit and not do

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Olaf Kirch
On Tuesday 17 July 2007 08:14, Jarek Poplawski wrote: > > If after poll_napi dev->quota <= 0 dev->poll is not run and > > __LINK_STATE_RX_SCHED bit (plus dev->poll_list) stays uncleared. > > Or, more precisely dev->poll_list will be cleared just after this, > and net_rx_action returns with

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Olaf Kirch
On Tuesday 17 July 2007 00:08, David Miller wrote: > Sure, but I thought it would be nice to give Olaf a day or two to > figure out what's going on rather than have the knee-jerk reaction to > just revert. Oh, reverting is fine with me. I'll just resubmit the patch. Olaf -- Olaf Kirch | --- o

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Jarek Poplawski
On Tue, Jul 17, 2007 at 07:46:39AM +0200, Jarek Poplawski wrote: ... > > static void net_rx_action(struct softirq_action *h) > > { > > struct softnet_data *queue = &__get_cpu_var(softnet_data); > > unsigned long start_time = jiffies; > > int budget = netdev_budget; > >

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Jarek Poplawski
On Tue, Jul 17, 2007 at 07:46:39AM +0200, Jarek Poplawski wrote: ... static void net_rx_action(struct softirq_action *h) { struct softnet_data *queue = __get_cpu_var(softnet_data); unsigned long start_time = jiffies; int budget = netdev_budget; void

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Olaf Kirch
On Tuesday 17 July 2007 00:08, David Miller wrote: Sure, but I thought it would be nice to give Olaf a day or two to figure out what's going on rather than have the knee-jerk reaction to just revert. Oh, reverting is fine with me. I'll just resubmit the patch. Olaf -- Olaf Kirch | --- o

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Olaf Kirch
On Tuesday 17 July 2007 08:14, Jarek Poplawski wrote: If after poll_napi dev-quota = 0 dev-poll is not run and __LINK_STATE_RX_SCHED bit (plus dev-poll_list) stays uncleared. Or, more precisely dev-poll_list will be cleared just after this, and net_rx_action returns with

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Olaf Kirch
On Monday 16 July 2007 23:40, Linus Torvalds wrote: - The change seems to always set the LIST_FROZEN bit when calling -poll(), and at least on e1000, the NAPI poll() routine ends up doing that netif_rx_complete(), so we're *guaranteed* to always take the early exit and not do

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Olaf Kirch
On Tuesday 17 July 2007 09:55, Olaf Kirch wrote: What I find more problematic about this portion of code though is that once a net_device is over quota, net_rx_action will loop for up to one jiffy, even if there's just this one device on the poll_list. Duh, wrong. For every loop, it'll add

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Ingo Molnar
Olaf, i've got a new observation: changing CONFIG_HZ from 250 to 1000 makes the problem go away. So it's somehow also related to jiffies. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Jarek Poplawski
On Tue, Jul 17, 2007 at 10:28:34AM +0200, Olaf Kirch wrote: On Tuesday 17 July 2007 09:55, Olaf Kirch wrote: What I find more problematic about this portion of code though is that once a net_device is over quota, net_rx_action will loop for up to one jiffy, even if there's just this one

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Jarek Poplawski
On Tue, Jul 17, 2007 at 10:57:48AM +0200, Ingo Molnar wrote: Olaf, i've got a new observation: changing CONFIG_HZ from 250 to 1000 makes the problem go away. So it's somehow also related to jiffies. IMHO it could be related with __LINK_STATE_RX_SCHED beeing set too long e.g. between two

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Olaf Kirch
On Tuesday 17 July 2007 10:57, Ingo Molnar wrote: i've got a new observation: changing CONFIG_HZ from 250 to 1000 makes the problem go away. So it's somehow also related to jiffies. There are several Tx Hang detected messages in the log, which looks a lot as if net_rx_action never runs, or at

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Ingo Molnar
* Olaf Kirch [EMAIL PROTECTED] wrote: Can you try what happens if you change netif_rx_complete to something like this: if (test_bit(__LINK_STATE_POLL_LIST_FROZEN, dev-state)) { dev-quota = dev-weight; return; } This is just a hack to make sure

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Linus Torvalds
On Tue, 17 Jul 2007, Ingo Molnar wrote: i've got a new observation: changing CONFIG_HZ from 250 to 1000 makes the problem go away. So it's somehow also related to jiffies. No, I suspect it's just related to timing: you need to hit that window when the LIST_FROZEN bit is set, and since it

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Olaf Kirch
Hi Ingo, On Tuesday 17 July 2007 18:57, Ingo Molnar wrote: i've done the patch below, but it did not change the timeouts nor did it solve the 'no network' problem. netconsole output hung earlier as well. Hm, pity. To rule out any e1000 problem, can you try the the following please, both with

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Ingo Molnar
* David Miller [EMAIL PROTECTED] wrote: From: Ingo Molnar [EMAIL PROTECTED] Date: Tue, 17 Jul 2007 00:37:18 +0200 I think if you leaned back and thought it through, and if you applied this scenario to a bad scheduler commit from me that broke your box, you'd readily agree with me

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Ingo Molnar
* Olaf Kirch [EMAIL PROTECTED] wrote: Hi Ingo, On Tuesday 17 July 2007 18:57, Ingo Molnar wrote: i've done the patch below, but it did not change the timeouts nor did it solve the 'no network' problem. netconsole output hung earlier as well. Hm, pity. To rule out any e1000 problem,

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Olaf Kirch
On Tuesday 17 July 2007 20:18, Ingo Molnar wrote: (one is HZ=100, the other HZ=1000. HZ=100 produces a hung network just like HZ=250.) no 'rx_sched set' messages in either case. Network still hung for HZ=100, and is working for HZ=1000. Is this from dmesg or the netconsole output? I don't

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-17 Thread Ingo Molnar
* Olaf Kirch [EMAIL PROTECTED] wrote: On Tuesday 17 July 2007 20:18, Ingo Molnar wrote: (one is HZ=100, the other HZ=1000. HZ=100 produces a hung network just like HZ=250.) no 'rx_sched set' messages in either case. Network still hung for HZ=100, and is working for HZ=1000. Is

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-16 Thread Jarek Poplawski
On 16-07-2007 11:12, Ingo Molnar wrote: > current -git broke my main testbox. No TCP/IP networking to/from the box > and e1000 would time out in xmit: > > NETDEV WATCHDOG: eth0: transmit timed out > e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang ... Olaf, I think this error can

Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll

2007-07-16 Thread Linus Torvalds
On Mon, 16 Jul 2007, Matt Mackall wrote: > > Unfortunately the particular patch from Olaf is presumably covering up > another bug that other people (including Olaf) had hit. So reverting > it is going to introduce a different regression. It's not a regression, it's an old problem. And the

  1   2   >