* Olaf Kirch <[EMAIL PROTECTED]> wrote:
> On Thursday 19 July 2007 21:56, Ingo Molnar wrote:
> > nope - with this patch applied the box still has no network, symptoms
> > are similar. (should i apply the WARN_ON() patch too?)
>
> Yes, that would be nice. If that doesn't help, you can also
* Olaf Kirch [EMAIL PROTECTED] wrote:
On Thursday 19 July 2007 21:56, Ingo Molnar wrote:
nope - with this patch applied the box still has no network, symptoms
are similar. (should i apply the WARN_ON() patch too?)
Yes, that would be nice. If that doesn't help, you can also throw in
the
On Thursday 19 July 2007 21:56, Ingo Molnar wrote:
> nope - with this patch applied the box still has no network, symptoms
> are similar. (should i apply the WARN_ON() patch too?)
Yes, that would be nice. If that doesn't help, you can also throw in
the one below.
Olaf
--
Olaf Kirch | --- o
* Olaf Kirch <[EMAIL PROTECTED]> wrote:
> Does the following help?
> --- build-2.6.orig/drivers/net/netconsole.c
> +++ build-2.6/drivers/net/netconsole.c
> @@ -70,7 +70,7 @@ static void write_msg(struct console *co
> int frag, left;
> unsigned long flags;
>
> - if (!np.dev)
>
Does the following help?
Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
[EMAIL PROTECTED] |/ | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax
Test patch
---
Index: build-2.6/drivers/net/netconsole.c
* Olaf Kirch <[EMAIL PROTECTED]> wrote:
> Here's a somewhat drastic modification that should not change any
> timing, but just verifies whether my patch is to blame at all. Can you
> give it a try?
> @@ -1027,7 +1027,7 @@ static inline void netif_rx_complete(str
>* But at least it
On Thursday 19 July 2007 18:07, Ingo Molnar wrote:
> because i dont seem to be able to trigger Olaf's WARN_ON(), can you see
> anything in the ethtool output that i sent in the previous mail(s)?
If the WARN_ON doesn't trigger, I cannot see how my patch would affect
your system.
- IF we
On Thursday 19 July 2007 19:36, Olaf Kirch wrote:
> Can you confirm this by spraying the laptop with arp packets
> or broadcast pings while it's booting?
Sorry for the noise - didn't see your other message where you
described just that.
This sounds more like a hardware issue - Rx interrupt seems
* Olaf Kirch <[EMAIL PROTECTED]> wrote:
> On Thursday 19 July 2007 18:05, Ingo Molnar wrote:
> > that network-intense test also produced periodic broadcast packets that
> > got the e1000 out of its weird state before the tx timeout could hit.
> > Now that i've stopped the test, the network is
On Thursday 19 July 2007 18:05, Ingo Molnar wrote:
> that network-intense test also produced periodic broadcast packets that
> got the e1000 out of its weird state before the tx timeout could hit.
> Now that i've stopped the test, the network is quiescent again and the
> e1000 hangs.
Can you
* Ingo Molnar <[EMAIL PROTECTED]> wrote:
> > i'll now check whether removing ignore_on_loglevel (no other
> > changes) makes the hang go away. Maybe ignore_on_loglevel is buggy -
> > or it produces an immediate printk (going out to the interface)
> > during a particularly sensitive period of
* Kok, Auke <[EMAIL PROTECTED]> wrote:
> > I don't have a fix ready yet - I hope I'll have something later this
> > afternoon.
>
> interesting, you seem to found the cause allright. I can't confirm the
> problem but I know that netpoll and NAPI has historically been an
> issue. I look
* Ingo Molnar <[EMAIL PROTECTED]> wrote:
> i'll now check whether removing ignore_on_loglevel (no other changes)
> makes the hang go away. Maybe ignore_on_loglevel is buggy - or it
> produces an immediate printk (going out to the interface) during a
> particularly sensitive period of network
* Ingo Molnar <[EMAIL PROTECTED]> wrote:
> ah! Just found the reason: the bug apparently depends on the precise
> kernel command-line contents. I accidentally dropped ignore_loglevel
> (found this while comparing with the older logs i sent to you), adding
> it back in produces hung networking
Olaf Kirch wrote:
On Thursday 19 July 2007 12:58, Ingo Molnar wrote:
i.e. it's the classic 'eth0 got stuck somehow' tx/rx state machine
hickup symptoms, with no other bad symptoms such as lockups or crashes.
Duh, I found it.
The e1000 poll routine does this to leave polling mode.
* Ingo Molnar <[EMAIL PROTECTED]> wrote:
> ugh. Something really weird happened with this e1000 problem.
>
> i crashed the laptop in a weird way and had to power-cycle it in an
> unusual fashion. After that i wanted to try your latest BUG_ON()
> theory but the network hang went away!
>
> For
On Thursday 19 July 2007 17:07, Ingo Molnar wrote:
> i crashed the laptop in a weird way and had to power-cycle it in an
> unusual fashion. After that i wanted to try your latest BUG_ON() theory
> but the network hang went away!
Should I rejoice, or regret? :-)
> maybe it's not the
ugh. Something really weird happened with this e1000 problem.
i crashed the laptop in a weird way and had to power-cycle it in an
unusual fashion. After that i wanted to try your latest BUG_ON() theory
but the network hang went away!
For 3 hours i tried to reproduce the hang (i went back to
On Thursday 19 July 2007 14:52, Olaf Kirch wrote:
> On Thursday 19 July 2007 12:58, Ingo Molnar wrote:
> > i.e. it's the classic 'eth0 got stuck somehow' tx/rx state machine
> > hickup symptoms, with no other bad symptoms such as lockups or crashes.
>
> Duh, I found it.
The following patch
On Thursday 19 July 2007 12:58, Ingo Molnar wrote:
> i.e. it's the classic 'eth0 got stuck somehow' tx/rx state machine
> hickup symptoms, with no other bad symptoms such as lockups or crashes.
Duh, I found it.
The e1000 poll routine does this to leave polling mode.
* Ingo Molnar <[EMAIL PROTECTED]> wrote:
> * Olaf Kirch <[EMAIL PROTECTED]> wrote:
>
> > On Thursday 19 July 2007 12:01, Ingo Molnar wrote:
> > > Calling initcall 0xc0603f55: netpoll_init+0x0/0x39()
> > > initcall 0xc0603f55: netpoll_init+0x0/0x39() returned 0.
> > > initcall 0xc0603f55 ran
* Olaf Kirch <[EMAIL PROTECTED]> wrote:
> On Thursday 19 July 2007 12:01, Ingo Molnar wrote:
> > Calling initcall 0xc0603f55: netpoll_init+0x0/0x39()
> > initcall 0xc0603f55: netpoll_init+0x0/0x39() returned 0.
> > initcall 0xc0603f55 ran for 0 msecs: netpoll_init+0x0/0x39()
> > Calling
On Thursday 19 July 2007 12:01, Ingo Molnar wrote:
> Calling initcall 0xc0603f55: netpoll_init+0x0/0x39()
> initcall 0xc0603f55: netpoll_init+0x0/0x39() returned 0.
> initcall 0xc0603f55 ran for 0 msecs: netpoll_init+0x0/0x39()
> Calling initcall 0xc0604257: netlink_proto_init+0x0/0x12a()
>
* Ingo Molnar <[EMAIL PROTECTED]> wrote:
> the e1000 in this laptop is historically pretty robust. The only
> problem i ever had with it were some rx/tx hw-engine latency problems
> [pings from the outside took up to 1 second to propagate] that were
> quickly fixed by the e1000 driver guys.
* Olaf Kirch <[EMAIL PROTECTED]> wrote:
> -You say that netconsole output continues to trickle after
> the network gets wedged. This could be caused by the
> e1000 watchdog, which triggers a NIC interrupt "to ensure
> rx ring is cleaned". I assume that this triggers the
>
On Thursday 19 July 2007 11:09, Ingo Molnar wrote:
> the e1000 in this laptop is historically pretty robust. The only problem
> i ever had with it were some rx/tx hw-engine latency problems [pings
> from the outside took up to 1 second to propagate] that were quickly
> fixed by the e1000 driver
i have your original patch applied to my working tree to be able to
observe this bug's behavior, and here's another observation: the problem
seems to go away if i turn on CONFIG_NO_HZ. So it looks timing related
indeed ...
but when the bug happens, it happens all the time, reboot after
On Wed, Jul 18, 2007 at 01:48:20PM +0200, Jarek Poplawski wrote:
...
> I'd be very glad if it could be verified and/or tested.
Jarek,
This patch is verified crap!
Regards,
Jarek P.
PS: Olaf,
You've written earlier that one of the main reasons for poll_napi is
to work when the kernel "doesn't
On Wed, Jul 18, 2007 at 01:48:20PM +0200, Jarek Poplawski wrote:
...
I'd be very glad if it could be verified and/or tested.
Jarek,
This patch is verified crap!
Regards,
Jarek P.
PS: Olaf,
You've written earlier that one of the main reasons for poll_napi is
to work when the kernel doesn't
i have your original patch applied to my working tree to be able to
observe this bug's behavior, and here's another observation: the problem
seems to go away if i turn on CONFIG_NO_HZ. So it looks timing related
indeed ...
but when the bug happens, it happens all the time, reboot after
On Thursday 19 July 2007 11:09, Ingo Molnar wrote:
the e1000 in this laptop is historically pretty robust. The only problem
i ever had with it were some rx/tx hw-engine latency problems [pings
from the outside took up to 1 second to propagate] that were quickly
fixed by the e1000 driver
* Olaf Kirch [EMAIL PROTECTED] wrote:
-You say that netconsole output continues to trickle after
the network gets wedged. This could be caused by the
e1000 watchdog, which triggers a NIC interrupt to ensure
rx ring is cleaned. I assume that this triggers the
* Ingo Molnar [EMAIL PROTECTED] wrote:
the e1000 in this laptop is historically pretty robust. The only
problem i ever had with it were some rx/tx hw-engine latency problems
[pings from the outside took up to 1 second to propagate] that were
quickly fixed by the e1000 driver guys. Maybe
On Thursday 19 July 2007 12:01, Ingo Molnar wrote:
Calling initcall 0xc0603f55: netpoll_init+0x0/0x39()
initcall 0xc0603f55: netpoll_init+0x0/0x39() returned 0.
initcall 0xc0603f55 ran for 0 msecs: netpoll_init+0x0/0x39()
Calling initcall 0xc0604257: netlink_proto_init+0x0/0x12a()
NET:
* Olaf Kirch [EMAIL PROTECTED] wrote:
On Thursday 19 July 2007 12:01, Ingo Molnar wrote:
Calling initcall 0xc0603f55: netpoll_init+0x0/0x39()
initcall 0xc0603f55: netpoll_init+0x0/0x39() returned 0.
initcall 0xc0603f55 ran for 0 msecs: netpoll_init+0x0/0x39()
Calling initcall
* Ingo Molnar [EMAIL PROTECTED] wrote:
* Olaf Kirch [EMAIL PROTECTED] wrote:
On Thursday 19 July 2007 12:01, Ingo Molnar wrote:
Calling initcall 0xc0603f55: netpoll_init+0x0/0x39()
initcall 0xc0603f55: netpoll_init+0x0/0x39() returned 0.
initcall 0xc0603f55 ran for 0 msecs:
On Thursday 19 July 2007 12:58, Ingo Molnar wrote:
i.e. it's the classic 'eth0 got stuck somehow' tx/rx state machine
hickup symptoms, with no other bad symptoms such as lockups or crashes.
Duh, I found it.
The e1000 poll routine does this to leave polling mode.
On Thursday 19 July 2007 14:52, Olaf Kirch wrote:
On Thursday 19 July 2007 12:58, Ingo Molnar wrote:
i.e. it's the classic 'eth0 got stuck somehow' tx/rx state machine
hickup symptoms, with no other bad symptoms such as lockups or crashes.
Duh, I found it.
The following patch should
ugh. Something really weird happened with this e1000 problem.
i crashed the laptop in a weird way and had to power-cycle it in an
unusual fashion. After that i wanted to try your latest BUG_ON() theory
but the network hang went away!
For 3 hours i tried to reproduce the hang (i went back to
On Thursday 19 July 2007 17:07, Ingo Molnar wrote:
i crashed the laptop in a weird way and had to power-cycle it in an
unusual fashion. After that i wanted to try your latest BUG_ON() theory
but the network hang went away!
Should I rejoice, or regret? :-)
maybe it's not the power-cycling
* Ingo Molnar [EMAIL PROTECTED] wrote:
ugh. Something really weird happened with this e1000 problem.
i crashed the laptop in a weird way and had to power-cycle it in an
unusual fashion. After that i wanted to try your latest BUG_ON()
theory but the network hang went away!
For 3 hours
Olaf Kirch wrote:
On Thursday 19 July 2007 12:58, Ingo Molnar wrote:
i.e. it's the classic 'eth0 got stuck somehow' tx/rx state machine
hickup symptoms, with no other bad symptoms such as lockups or crashes.
Duh, I found it.
The e1000 poll routine does this to leave polling mode.
* Ingo Molnar [EMAIL PROTECTED] wrote:
ah! Just found the reason: the bug apparently depends on the precise
kernel command-line contents. I accidentally dropped ignore_loglevel
(found this while comparing with the older logs i sent to you), adding
it back in produces hung networking too.
* Ingo Molnar [EMAIL PROTECTED] wrote:
i'll now check whether removing ignore_on_loglevel (no other changes)
makes the hang go away. Maybe ignore_on_loglevel is buggy - or it
produces an immediate printk (going out to the interface) during a
particularly sensitive period of network
* Kok, Auke [EMAIL PROTECTED] wrote:
I don't have a fix ready yet - I hope I'll have something later this
afternoon.
interesting, you seem to found the cause allright. I can't confirm the
problem but I know that netpoll and NAPI has historically been an
issue. I look forward to your
* Ingo Molnar [EMAIL PROTECTED] wrote:
i'll now check whether removing ignore_on_loglevel (no other
changes) makes the hang go away. Maybe ignore_on_loglevel is buggy -
or it produces an immediate printk (going out to the interface)
during a particularly sensitive period of network
On Thursday 19 July 2007 18:05, Ingo Molnar wrote:
that network-intense test also produced periodic broadcast packets that
got the e1000 out of its weird state before the tx timeout could hit.
Now that i've stopped the test, the network is quiescent again and the
e1000 hangs.
Can you
* Olaf Kirch [EMAIL PROTECTED] wrote:
On Thursday 19 July 2007 18:05, Ingo Molnar wrote:
that network-intense test also produced periodic broadcast packets that
got the e1000 out of its weird state before the tx timeout could hit.
Now that i've stopped the test, the network is quiescent
On Thursday 19 July 2007 19:36, Olaf Kirch wrote:
Can you confirm this by spraying the laptop with arp packets
or broadcast pings while it's booting?
Sorry for the noise - didn't see your other message where you
described just that.
This sounds more like a hardware issue - Rx interrupt seems
On Thursday 19 July 2007 18:07, Ingo Molnar wrote:
because i dont seem to be able to trigger Olaf's WARN_ON(), can you see
anything in the ethtool output that i sent in the previous mail(s)?
If the WARN_ON doesn't trigger, I cannot see how my patch would affect
your system.
- IF we
* Olaf Kirch [EMAIL PROTECTED] wrote:
Here's a somewhat drastic modification that should not change any
timing, but just verifies whether my patch is to blame at all. Can you
give it a try?
@@ -1027,7 +1027,7 @@ static inline void netif_rx_complete(str
* But at least it doesn't
Does the following help?
Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
[EMAIL PROTECTED] |/ | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax
Test patch
---
Index: build-2.6/drivers/net/netconsole.c
* Olaf Kirch [EMAIL PROTECTED] wrote:
Does the following help?
--- build-2.6.orig/drivers/net/netconsole.c
+++ build-2.6/drivers/net/netconsole.c
@@ -70,7 +70,7 @@ static void write_msg(struct console *co
int frag, left;
unsigned long flags;
- if (!np.dev)
+ if
On Thursday 19 July 2007 21:56, Ingo Molnar wrote:
nope - with this patch applied the box still has no network, symptoms
are similar. (should i apply the WARN_ON() patch too?)
Yes, that would be nice. If that doesn't help, you can also throw in
the one below.
Olaf
--
Olaf Kirch | --- o ---
* Olaf Kirch <[EMAIL PROTECTED]> wrote:
> > also, i'm using netconsole via the command line (both the network
> > driver and netconsole is built into the bzImage), maybe that makes a
> > difference?
>
> Possibly - but so far there's nothing in the code that jumped at me.
>
> Can you try the
On Wednesday 18 July 2007 14:48, Ingo Molnar wrote:
> something i noticed: netconsole output seems to trickle through though,
> but very, very slowly (a packet once every 4 seconds or so). TCP/IP is
> not functional.
>
> also, i'm using netconsole via the command line (both the network driver
* Olaf Kirch <[EMAIL PROTECTED]> wrote:
> On Tuesday 17 July 2007 20:56, Ingo Molnar wrote:
> > i logged these not via netconsole but via logging on over the console
> > and using dmesg, so it should include everything. in the 100hz case the
> > following seems to show the anomaly:
> >
> >
* Olaf Kirch <[EMAIL PROTECTED]> wrote:
> On Tuesday 17 July 2007 20:56, Ingo Molnar wrote:
> > i logged these not via netconsole but via logging on over the console
> > and using dmesg, so it should include everything. in the 100hz case the
> > following seems to show the anomaly:
> >
> >
On Tuesday 17 July 2007 20:56, Ingo Molnar wrote:
> i logged these not via netconsole but via logging on over the console
> and using dmesg, so it should include everything. in the 100hz case the
> following seems to show the anomaly:
>
> NETDEV WATCHDOG: eth0: transmit timed out
So, it
Hi,
Here is my proposal of a solution based on dev->state flag,
but intended mainly to prevent poll_napi from disturbing
while net_rx_action is running and polling the device.
It doesn't look very nice or clean but I hope it could
guard net_rx_action enough with some room for netpoll too.
I'd
Hi,
Here is my proposal of a solution based on dev-state flag,
but intended mainly to prevent poll_napi from disturbing
while net_rx_action is running and polling the device.
It doesn't look very nice or clean but I hope it could
guard net_rx_action enough with some room for netpoll too.
I'd be
On Tuesday 17 July 2007 20:56, Ingo Molnar wrote:
i logged these not via netconsole but via logging on over the console
and using dmesg, so it should include everything. in the 100hz case the
following seems to show the anomaly:
NETDEV WATCHDOG: eth0: transmit timed out
So, it seems as
* Olaf Kirch [EMAIL PROTECTED] wrote:
On Tuesday 17 July 2007 20:56, Ingo Molnar wrote:
i logged these not via netconsole but via logging on over the console
and using dmesg, so it should include everything. in the 100hz case the
following seems to show the anomaly:
NETDEV
* Olaf Kirch [EMAIL PROTECTED] wrote:
On Tuesday 17 July 2007 20:56, Ingo Molnar wrote:
i logged these not via netconsole but via logging on over the console
and using dmesg, so it should include everything. in the 100hz case the
following seems to show the anomaly:
NETDEV
On Wednesday 18 July 2007 14:48, Ingo Molnar wrote:
something i noticed: netconsole output seems to trickle through though,
but very, very slowly (a packet once every 4 seconds or so). TCP/IP is
not functional.
also, i'm using netconsole via the command line (both the network driver
and
* Olaf Kirch [EMAIL PROTECTED] wrote:
also, i'm using netconsole via the command line (both the network
driver and netconsole is built into the bzImage), maybe that makes a
difference?
Possibly - but so far there's nothing in the code that jumped at me.
Can you try the following
* Olaf Kirch <[EMAIL PROTECTED]> wrote:
> On Tuesday 17 July 2007 20:18, Ingo Molnar wrote:
> > (one is HZ=100, the other HZ=1000. HZ=100 produces a hung network just
> > like HZ=250.)
> >
> > no 'rx_sched set' messages in either case. Network still hung for
> > HZ=100, and is working for
On Tuesday 17 July 2007 20:18, Ingo Molnar wrote:
> (one is HZ=100, the other HZ=1000. HZ=100 produces a hung network just
> like HZ=250.)
>
> no 'rx_sched set' messages in either case. Network still hung for
> HZ=100, and is working for HZ=1000.
Is this from dmesg or the netconsole output? I
* Olaf Kirch <[EMAIL PROTECTED]> wrote:
> Hi Ingo,
>
> On Tuesday 17 July 2007 18:57, Ingo Molnar wrote:
> > i've done the patch below, but it did not change the timeouts nor did it
> > solve the 'no network' problem. netconsole output hung earlier as well.
> Hm, pity.
>
> To rule out any
* David Miller <[EMAIL PROTECTED]> wrote:
> From: Ingo Molnar <[EMAIL PROTECTED]>
> Date: Tue, 17 Jul 2007 00:37:18 +0200
>
> > I think if you leaned back and thought it through, and if you
> > applied this scenario to a bad scheduler commit from me that broke
> > your box, you'd readily
Hi Ingo,
On Tuesday 17 July 2007 18:57, Ingo Molnar wrote:
> i've done the patch below, but it did not change the timeouts nor did it
> solve the 'no network' problem. netconsole output hung earlier as well.
Hm, pity.
To rule out any e1000 problem, can you try the the following please,
both
On Tue, 17 Jul 2007, Ingo Molnar wrote:
>
> i've got a new observation: changing CONFIG_HZ from 250 to 1000 makes
> the problem go away. So it's somehow also related to jiffies.
No, I suspect it's just related to timing: you need to hit that window
when the LIST_FROZEN bit is set, and since
* Olaf Kirch <[EMAIL PROTECTED]> wrote:
> Can you try what happens if you change netif_rx_complete to something
> like this:
>
> if (test_bit(__LINK_STATE_POLL_LIST_FROZEN, >state)) {
> dev->quota = dev->weight;
> return;
> }
>
> This is just a hack to
On Tuesday 17 July 2007 10:57, Ingo Molnar wrote:
> i've got a new observation: changing CONFIG_HZ from 250 to 1000 makes
> the problem go away. So it's somehow also related to jiffies.
There are several "Tx Hang detected" messages in the log, which looks
a lot as if net_rx_action never runs, or
On Tue, Jul 17, 2007 at 10:57:48AM +0200, Ingo Molnar wrote:
>
> Olaf,
>
> i've got a new observation: changing CONFIG_HZ from 250 to 1000 makes
> the problem go away. So it's somehow also related to jiffies.
IMHO it could be related with __LINK_STATE_RX_SCHED beeing set
too long e.g. between
On Tue, Jul 17, 2007 at 10:28:34AM +0200, Olaf Kirch wrote:
> On Tuesday 17 July 2007 09:55, Olaf Kirch wrote:
> > What I find more problematic about this portion of code though
> > is that once a net_device is over quota, net_rx_action will
> > loop for up to one jiffy, even if there's just this
Olaf,
i've got a new observation: changing CONFIG_HZ from 250 to 1000 makes
the problem go away. So it's somehow also related to jiffies.
Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at
On Tuesday 17 July 2007 09:55, Olaf Kirch wrote:
> What I find more problematic about this portion of code though
> is that once a net_device is over quota, net_rx_action will
> loop for up to one jiffy, even if there's just this one device on
> the poll_list.
Duh, wrong. For every loop, it'll
On Monday 16 July 2007 23:40, Linus Torvalds wrote:
> - The change seems to always set the LIST_FROZEN bit when calling
>->poll(), and at least on e1000, the NAPI poll() routine ends up doing
>that netif_rx_complete(), so we're *guaranteed* to always take the
>early exit and not do
On Tuesday 17 July 2007 08:14, Jarek Poplawski wrote:
> > If after poll_napi dev->quota <= 0 dev->poll is not run and
> > __LINK_STATE_RX_SCHED bit (plus dev->poll_list) stays uncleared.
>
> Or, more precisely dev->poll_list will be cleared just after this,
> and net_rx_action returns with
On Tuesday 17 July 2007 00:08, David Miller wrote:
> Sure, but I thought it would be nice to give Olaf a day or two to
> figure out what's going on rather than have the knee-jerk reaction to
> just revert.
Oh, reverting is fine with me. I'll just resubmit the patch.
Olaf
--
Olaf Kirch | --- o
On Tue, Jul 17, 2007 at 07:46:39AM +0200, Jarek Poplawski wrote:
...
> > static void net_rx_action(struct softirq_action *h)
> > {
> > struct softnet_data *queue = &__get_cpu_var(softnet_data);
> > unsigned long start_time = jiffies;
> > int budget = netdev_budget;
> >
On Tue, Jul 17, 2007 at 07:46:39AM +0200, Jarek Poplawski wrote:
...
static void net_rx_action(struct softirq_action *h)
{
struct softnet_data *queue = __get_cpu_var(softnet_data);
unsigned long start_time = jiffies;
int budget = netdev_budget;
void
On Tuesday 17 July 2007 00:08, David Miller wrote:
Sure, but I thought it would be nice to give Olaf a day or two to
figure out what's going on rather than have the knee-jerk reaction to
just revert.
Oh, reverting is fine with me. I'll just resubmit the patch.
Olaf
--
Olaf Kirch | --- o
On Tuesday 17 July 2007 08:14, Jarek Poplawski wrote:
If after poll_napi dev-quota = 0 dev-poll is not run and
__LINK_STATE_RX_SCHED bit (plus dev-poll_list) stays uncleared.
Or, more precisely dev-poll_list will be cleared just after this,
and net_rx_action returns with
On Monday 16 July 2007 23:40, Linus Torvalds wrote:
- The change seems to always set the LIST_FROZEN bit when calling
-poll(), and at least on e1000, the NAPI poll() routine ends up doing
that netif_rx_complete(), so we're *guaranteed* to always take the
early exit and not do
On Tuesday 17 July 2007 09:55, Olaf Kirch wrote:
What I find more problematic about this portion of code though
is that once a net_device is over quota, net_rx_action will
loop for up to one jiffy, even if there's just this one device on
the poll_list.
Duh, wrong. For every loop, it'll add
Olaf,
i've got a new observation: changing CONFIG_HZ from 250 to 1000 makes
the problem go away. So it's somehow also related to jiffies.
Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at
On Tue, Jul 17, 2007 at 10:28:34AM +0200, Olaf Kirch wrote:
On Tuesday 17 July 2007 09:55, Olaf Kirch wrote:
What I find more problematic about this portion of code though
is that once a net_device is over quota, net_rx_action will
loop for up to one jiffy, even if there's just this one
On Tue, Jul 17, 2007 at 10:57:48AM +0200, Ingo Molnar wrote:
Olaf,
i've got a new observation: changing CONFIG_HZ from 250 to 1000 makes
the problem go away. So it's somehow also related to jiffies.
IMHO it could be related with __LINK_STATE_RX_SCHED beeing set
too long e.g. between two
On Tuesday 17 July 2007 10:57, Ingo Molnar wrote:
i've got a new observation: changing CONFIG_HZ from 250 to 1000 makes
the problem go away. So it's somehow also related to jiffies.
There are several Tx Hang detected messages in the log, which looks
a lot as if net_rx_action never runs, or at
* Olaf Kirch [EMAIL PROTECTED] wrote:
Can you try what happens if you change netif_rx_complete to something
like this:
if (test_bit(__LINK_STATE_POLL_LIST_FROZEN, dev-state)) {
dev-quota = dev-weight;
return;
}
This is just a hack to make sure
On Tue, 17 Jul 2007, Ingo Molnar wrote:
i've got a new observation: changing CONFIG_HZ from 250 to 1000 makes
the problem go away. So it's somehow also related to jiffies.
No, I suspect it's just related to timing: you need to hit that window
when the LIST_FROZEN bit is set, and since it
Hi Ingo,
On Tuesday 17 July 2007 18:57, Ingo Molnar wrote:
i've done the patch below, but it did not change the timeouts nor did it
solve the 'no network' problem. netconsole output hung earlier as well.
Hm, pity.
To rule out any e1000 problem, can you try the the following please,
both with
* David Miller [EMAIL PROTECTED] wrote:
From: Ingo Molnar [EMAIL PROTECTED]
Date: Tue, 17 Jul 2007 00:37:18 +0200
I think if you leaned back and thought it through, and if you
applied this scenario to a bad scheduler commit from me that broke
your box, you'd readily agree with me
* Olaf Kirch [EMAIL PROTECTED] wrote:
Hi Ingo,
On Tuesday 17 July 2007 18:57, Ingo Molnar wrote:
i've done the patch below, but it did not change the timeouts nor did it
solve the 'no network' problem. netconsole output hung earlier as well.
Hm, pity.
To rule out any e1000 problem,
On Tuesday 17 July 2007 20:18, Ingo Molnar wrote:
(one is HZ=100, the other HZ=1000. HZ=100 produces a hung network just
like HZ=250.)
no 'rx_sched set' messages in either case. Network still hung for
HZ=100, and is working for HZ=1000.
Is this from dmesg or the netconsole output? I don't
* Olaf Kirch [EMAIL PROTECTED] wrote:
On Tuesday 17 July 2007 20:18, Ingo Molnar wrote:
(one is HZ=100, the other HZ=1000. HZ=100 produces a hung network just
like HZ=250.)
no 'rx_sched set' messages in either case. Network still hung for
HZ=100, and is working for HZ=1000.
Is
On 16-07-2007 11:12, Ingo Molnar wrote:
> current -git broke my main testbox. No TCP/IP networking to/from the box
> and e1000 would time out in xmit:
>
> NETDEV WATCHDOG: eth0: transmit timed out
> e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
...
Olaf, I think this error can
On Mon, 16 Jul 2007, Matt Mackall wrote:
>
> Unfortunately the particular patch from Olaf is presumably covering up
> another bug that other people (including Olaf) had hit. So reverting
> it is going to introduce a different regression.
It's not a regression, it's an old problem.
And the
1 - 100 of 132 matches
Mail list logo