Re: [patch (testing)] Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-10 Thread Jarek Poplawski
On Fri, Aug 10, 2007 at 12:43:43PM +0200, Marcin Ślusarz wrote: > 2007/8/10, Jarek Poplawski <[EMAIL PROTECTED]>: > > (..) > > I think, there is this one possible for your testing yet?: > > Subject: [patch] genirq: temporary fix for level-triggered IRQ resend > > Date: Wed, 8 Aug 2007 13:00:37

Re: [patch (testing)] Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-10 Thread Marcin Ślusarz
2007/8/10, Jarek Poplawski <[EMAIL PROTECTED]>: > (..) > I think, there is this one possible for your testing yet?: > Subject: [patch] genirq: temporary fix for level-triggered IRQ resend > Date: Wed, 8 Aug 2007 13:00:37 +0200 I think I already tested this patch, but this thread is sooo big and I

Re: [patch (testing)] Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-10 Thread Ingo Molnar
* Jarek Poplawski <[EMAIL PROTECTED]> wrote: > All correct! There was also checked a possibility it can be not hw > itself, but wrong way of handling after hw (acking too late). This was > false idea (or bad implementation), so it looks like hw vs lapic > problem. i think the problem is that

Re: [patch (testing)] Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-10 Thread Jarek Poplawski
On Fri, Aug 10, 2007 at 11:08:33AM +0200, Ingo Molnar wrote: > > * Jarek Poplawski <[EMAIL PROTECTED]> wrote: > > > On 10-08-2007 10:05, Thomas Gleixner wrote: > > ... > > > But suppressing the resend is not fixing the driver problem. The > > > problem can show up with spurious interrupts and

Re: [patch (testing)] Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-10 Thread Ingo Molnar
* Jarek Poplawski <[EMAIL PROTECTED]> wrote: > On 10-08-2007 10:05, Thomas Gleixner wrote: > ... > > But suppressing the resend is not fixing the driver problem. The > > problem can show up with spurious interrupts and with interrupts on > > a shared PCI interrupt line at any time. It just

Re: [patch (testing)] Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-10 Thread Jarek Poplawski
On Fri, Aug 10, 2007 at 10:48:41AM +0200, Ingo Molnar wrote: > > * Jarek Poplawski <[EMAIL PROTECTED]> wrote: > > > On Fri, Aug 10, 2007 at 10:15:53AM +0200, Jean-Baptiste Vignaud wrote: > > ... > > > I was still testing on -rc2: > > > Subject: [patch] genirq: temporary fix for level-triggered

Re: [patch (testing)] Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-10 Thread Ingo Molnar
* Jarek Poplawski <[EMAIL PROTECTED]> wrote: > On Fri, Aug 10, 2007 at 10:15:53AM +0200, Jean-Baptiste Vignaud wrote: > ... > > I was still testing on -rc2: > > Subject: [patch] genirq: temporary fix for level-triggered IRQ resend > > Date: Wed, 8 Aug 2007 13:00:37 +0200 > > > > For me after

Re: [patch (testing)] Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-10 Thread Jean-Baptiste Vignaud
> For me it's enough too but Thomas seems to doubt. > > You've written earlier that you've 2.6.23-rc1 with HARDIRQS_SW_RESEND > prepared too. So, if this is not a great problem maybe you could try > this first. Tomorrow Thomas may send something, so this 100HZ could > wait yet, I hope? Ok, i'll

Re: [patch (testing)] Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-10 Thread Jarek Poplawski
On Fri, Aug 10, 2007 at 10:15:53AM +0200, Jean-Baptiste Vignaud wrote: ... > I was still testing on -rc2: > Subject: [patch] genirq: temporary fix for level-triggered IRQ resend > Date: Wed, 8 Aug 2007 13:00:37 +0200 > > For me after 1day 20hours, the network is still up, with more than 1To > of

Re: [patch (testing)] Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-10 Thread Jean-Baptiste Vignaud
> So, we still have to wait for the exact explanation... > > Thanks very much Marcin! > > I think, there is this one possible for your testing yet?: > Subject: [patch] genirq: temporary fix for level-triggered IRQ resend > Date: Wed, 8 Aug 2007 13:00:37 +0200 > > If it's not a great problem it

Re: [patch (testing)] Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-10 Thread Jarek Poplawski
On Fri, Aug 10, 2007 at 08:33:27AM +0200, Marcin Ślusarz wrote: > 2007/8/9, Jarek Poplawski <[EMAIL PROTECTED]>: ... > > diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c > > --- 2.6.23-rc1-/kernel/irq/chip.c 2007-07-09 01:32:17.0 +0200 > > +++

Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Jarek Poplawski
On Fri, Aug 10, 2007 at 08:33:27AM +0200, Marcin Ślusarz wrote: 2007/8/9, Jarek Poplawski [EMAIL PROTECTED]: ... diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c --- 2.6.23-rc1-/kernel/irq/chip.c 2007-07-09 01:32:17.0 +0200 +++ 2.6.23-rc1/kernel/irq/chip.c

Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Jarek Poplawski
On Fri, Aug 10, 2007 at 11:08:33AM +0200, Ingo Molnar wrote: * Jarek Poplawski [EMAIL PROTECTED] wrote: On 10-08-2007 10:05, Thomas Gleixner wrote: ... But suppressing the resend is not fixing the driver problem. The problem can show up with spurious interrupts and with interrupts

Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Jarek Poplawski
On Fri, Aug 10, 2007 at 10:48:41AM +0200, Ingo Molnar wrote: * Jarek Poplawski [EMAIL PROTECTED] wrote: On Fri, Aug 10, 2007 at 10:15:53AM +0200, Jean-Baptiste Vignaud wrote: ... I was still testing on -rc2: Subject: [patch] genirq: temporary fix for level-triggered IRQ resend

Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Jarek Poplawski
On Fri, Aug 10, 2007 at 10:15:53AM +0200, Jean-Baptiste Vignaud wrote: ... I was still testing on -rc2: Subject: [patch] genirq: temporary fix for level-triggered IRQ resend Date: Wed, 8 Aug 2007 13:00:37 +0200 For me after 1day 20hours, the network is still up, with more than 1To of

Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Jean-Baptiste Vignaud
So, we still have to wait for the exact explanation... Thanks very much Marcin! I think, there is this one possible for your testing yet?: Subject: [patch] genirq: temporary fix for level-triggered IRQ resend Date: Wed, 8 Aug 2007 13:00:37 +0200 If it's not a great problem it would be

Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Jean-Baptiste Vignaud
For me it's enough too but Thomas seems to doubt. You've written earlier that you've 2.6.23-rc1 with HARDIRQS_SW_RESEND prepared too. So, if this is not a great problem maybe you could try this first. Tomorrow Thomas may send something, so this 100HZ could wait yet, I hope? Ok, i'll test

Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Ingo Molnar
* Jarek Poplawski [EMAIL PROTECTED] wrote: On Fri, Aug 10, 2007 at 10:15:53AM +0200, Jean-Baptiste Vignaud wrote: ... I was still testing on -rc2: Subject: [patch] genirq: temporary fix for level-triggered IRQ resend Date: Wed, 8 Aug 2007 13:00:37 +0200 For me after 1day 20hours,

Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Ingo Molnar
* Jarek Poplawski [EMAIL PROTECTED] wrote: On 10-08-2007 10:05, Thomas Gleixner wrote: ... But suppressing the resend is not fixing the driver problem. The problem can show up with spurious interrupts and with interrupts on a shared PCI interrupt line at any time. It just might take

Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Ingo Molnar
* Jarek Poplawski [EMAIL PROTECTED] wrote: All correct! There was also checked a possibility it can be not hw itself, but wrong way of handling after hw (acking too late). This was false idea (or bad implementation), so it looks like hw vs lapic problem. i think the problem is that local

Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Marcin Ślusarz
2007/8/10, Jarek Poplawski [EMAIL PROTECTED]: (..) I think, there is this one possible for your testing yet?: Subject: [patch] genirq: temporary fix for level-triggered IRQ resend Date: Wed, 8 Aug 2007 13:00:37 +0200 I think I already tested this patch, but this thread is sooo big and I can't

Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Jarek Poplawski
On Fri, Aug 10, 2007 at 12:43:43PM +0200, Marcin Ślusarz wrote: 2007/8/10, Jarek Poplawski [EMAIL PROTECTED]: (..) I think, there is this one possible for your testing yet?: Subject: [patch] genirq: temporary fix for level-triggered IRQ resend Date: Wed, 8 Aug 2007 13:00:37 +0200 I think

Re: [RFC] Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-09 Thread Jarek Poplawski
On Thu, Aug 09, 2007 at 06:04:34PM +0200, Andi Kleen wrote: > Jarek Poplawski <[EMAIL PROTECTED]> writes: > > > It seems, we can start to think about some preferred solutions, > > already. Here are some of my preliminary conclusions and suggestions. > > > > The problem of timeouts with some

[RFC] Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-09 Thread Jarek Poplawski
It seems, we can start to think about some preferred solutions, already. Here are some of my preliminary conclusions and suggestions. The problem of timeouts with some 'older' network cards seems to hit mainly x86_64 arch, and after diagnosing and testing (still beeing done) it's caused by

[patch (testing)] Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-09 Thread Jarek Poplawski
On Wed, Aug 08, 2007 at 01:42:43PM +0200, Jarek Poplawski wrote: > Read below please: > > On Wed, Aug 08, 2007 at 01:09:36PM +0200, Marcin Ślusarz wrote: > > 2007/8/7, Jarek Poplawski <[EMAIL PROTECTED]>: > > > So, the let's try this idea yet: modified Ingo's "x86: activate > > >

[patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-09 Thread Jarek Poplawski
On Wed, Aug 08, 2007 at 01:42:43PM +0200, Jarek Poplawski wrote: Read below please: On Wed, Aug 08, 2007 at 01:09:36PM +0200, Marcin Ślusarz wrote: 2007/8/7, Jarek Poplawski [EMAIL PROTECTED]: So, the let's try this idea yet: modified Ingo's x86: activate HARDIRQS_SW_RESEND patch.

[RFC] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-09 Thread Jarek Poplawski
It seems, we can start to think about some preferred solutions, already. Here are some of my preliminary conclusions and suggestions. The problem of timeouts with some 'older' network cards seems to hit mainly x86_64 arch, and after diagnosing and testing (still beeing done) it's caused by

Re: [RFC] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-09 Thread Jarek Poplawski
On Thu, Aug 09, 2007 at 06:04:34PM +0200, Andi Kleen wrote: Jarek Poplawski [EMAIL PROTECTED] writes: It seems, we can start to think about some preferred solutions, already. Here are some of my preliminary conclusions and suggestions. The problem of timeouts with some 'older' network

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-08 Thread Jarek Poplawski
On Wed, Aug 08, 2007 at 10:59:22AM +0200, Jean-Baptiste Vignaud wrote: ... > > If you would like to read something more about testing (then of > > course my suggestions could occur invalid - I'm a very bad tester > > myself...) you can try this: > > http://www.stardust.webpages.pl/files/handbook/

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-08 Thread Jarek Poplawski
On Wed, Aug 08, 2007 at 01:42:43PM +0200, Jarek Poplawski wrote: ... > So, it looks like x86_64 io_apic's IPI code was unused too long... To be fair it's x86_64 lapic's IPI code. Jarek P. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-08 Thread Jarek Poplawski
Read below please: On Wed, Aug 08, 2007 at 01:09:36PM +0200, Marcin Ślusarz wrote: > 2007/8/7, Jarek Poplawski <[EMAIL PROTECTED]>: > > So, the let's try this idea yet: modified Ingo's "x86: activate > > HARDIRQS_SW_RESEND" patch. > > (Don't forget about make oldconfig before make.) > > For

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-08 Thread Marcin Ślusarz
2007/8/7, Jarek Poplawski <[EMAIL PROTECTED]>: > And here is one more patch to test the same idea (chip->retrigger()). > Let's try i386 way! (I hope I will not be arrested for this...) > (Should be tested without any previous patches.) > > Jarek P. > > PS: as above > > --- > > diff -Nurp

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-08 Thread Marcin Ślusarz
2007/8/7, Jarek Poplawski <[EMAIL PROTECTED]>: > So, the let's try this idea yet: modified Ingo's "x86: activate > HARDIRQS_SW_RESEND" patch. > (Don't forget about make oldconfig before make.) > For testing only. > > Cheers, > Jarek P. > > PS: alas there was not even time for "compile checking"...

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-08 Thread Jarek Poplawski
On Wed, Aug 08, 2007 at 10:59:22AM +0200, Jean-Baptiste Vignaud wrote: > > Jean-Baptiste: I'm not sure how much of this testing you can afford? > > If you can spare some time for this and your box isn't for > > 'production' it could be very precious to diagnose such reproducible > > bug. > > Well

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-08 Thread Jean-Baptiste Vignaud
> Jean-Baptiste: I'm not sure how much of this testing you can afford? > If you can spare some time for this and your box isn't for > 'production' it could be very precious to diagnose such reproducible > bug. Well i can continue testing patches for sure. > Then, I'd have a few suggestions (you

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-08 Thread Jarek Poplawski
On Wed, Aug 08, 2007 at 09:21:14AM +0200, Jarek Poplawski wrote: > On Tue, Aug 07, 2007 at 07:16:33PM +0200, Jean-Baptiste Vignaud wrote: ... > Marcin has done this with successfully using the most professional > way: git bisect (which btw. I did learn yet), but, IMHO, it could be ... Let me say

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-08 Thread Jarek Poplawski
On Tue, Aug 07, 2007 at 07:16:33PM +0200, Jean-Baptiste Vignaud wrote: ... > So this afternoon i compiled 2.6.23-rc2 with same options as 2.6.23-rc1 > and edited grub.conf to add nosmp but after reboot the box did not > responded. Back home, i saw that the kernel failed because it was unable > to

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-08 Thread Jarek Poplawski
On Tue, Aug 07, 2007 at 07:16:33PM +0200, Jean-Baptiste Vignaud wrote: ... So this afternoon i compiled 2.6.23-rc2 with same options as 2.6.23-rc1 and edited grub.conf to add nosmp but after reboot the box did not responded. Back home, i saw that the kernel failed because it was unable to find

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-08 Thread Jarek Poplawski
On Wed, Aug 08, 2007 at 09:21:14AM +0200, Jarek Poplawski wrote: On Tue, Aug 07, 2007 at 07:16:33PM +0200, Jean-Baptiste Vignaud wrote: ... Marcin has done this with successfully using the most professional way: git bisect (which btw. I did learn yet), but, IMHO, it could be ... Let me say this

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-08 Thread Jean-Baptiste Vignaud
Jean-Baptiste: I'm not sure how much of this testing you can afford? If you can spare some time for this and your box isn't for 'production' it could be very precious to diagnose such reproducible bug. Well i can continue testing patches for sure. Then, I'd have a few suggestions (you could

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-08 Thread Jarek Poplawski
On Wed, Aug 08, 2007 at 10:59:22AM +0200, Jean-Baptiste Vignaud wrote: Jean-Baptiste: I'm not sure how much of this testing you can afford? If you can spare some time for this and your box isn't for 'production' it could be very precious to diagnose such reproducible bug. Well i can

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-08 Thread Marcin Ślusarz
2007/8/7, Jarek Poplawski [EMAIL PROTECTED]: So, the let's try this idea yet: modified Ingo's x86: activate HARDIRQS_SW_RESEND patch. (Don't forget about make oldconfig before make.) For testing only. Cheers, Jarek P. PS: alas there was not even time for compile checking... --- diff

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-08 Thread Marcin Ślusarz
2007/8/7, Jarek Poplawski [EMAIL PROTECTED]: And here is one more patch to test the same idea (chip-retrigger()). Let's try i386 way! (I hope I will not be arrested for this...) (Should be tested without any previous patches.) Jarek P. PS: as above --- diff -Nurp

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-08 Thread Jarek Poplawski
Read below please: On Wed, Aug 08, 2007 at 01:09:36PM +0200, Marcin Ślusarz wrote: 2007/8/7, Jarek Poplawski [EMAIL PROTECTED]: So, the let's try this idea yet: modified Ingo's x86: activate HARDIRQS_SW_RESEND patch. (Don't forget about make oldconfig before make.) For testing only.

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-08 Thread Jarek Poplawski
On Wed, Aug 08, 2007 at 01:42:43PM +0200, Jarek Poplawski wrote: ... So, it looks like x86_64 io_apic's IPI code was unused too long... To be fair it's x86_64 lapic's IPI code. Jarek P. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-08 Thread Jarek Poplawski
On Wed, Aug 08, 2007 at 10:59:22AM +0200, Jean-Baptiste Vignaud wrote: ... If you would like to read something more about testing (then of course my suggestions could occur invalid - I'm a very bad tester myself...) you can try this: http://www.stardust.webpages.pl/files/handbook/ I'll

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-07 Thread Jean-Baptiste Vignaud
> On Tue, Aug 07, 2007 at 11:21:07AM +0200, Jean-Baptiste Vignaud wrote: > > > > > > * interrupts (i use irqbalance, but problem was the same without) > > > > > > I wonder if you tried without SMP too? > > > > No i did not. Do you think that this can be a problem ? > > To test with no SMP, do i

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-07 Thread Jarek Poplawski
On Tue, Aug 07, 2007 at 02:13:39PM +0200, Jarek Poplawski wrote: > On Tue, Aug 07, 2007 at 11:52:46AM +0200, Jarek Poplawski wrote: > > On Tue, Aug 07, 2007 at 11:37:01AM +0200, Marcin Ślusarz wrote: ... > > > No, i don't need a break. I'll have more time in next weeks. > > > > Great! So, I'll

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-07 Thread Jarek Poplawski
On Tue, Aug 07, 2007 at 11:52:46AM +0200, Jarek Poplawski wrote: > On Tue, Aug 07, 2007 at 11:37:01AM +0200, Marcin Ślusarz wrote: > > 2007/8/7, Jarek Poplawski <[EMAIL PROTECTED]>: > > > On Tue, Aug 07, 2007 at 09:46:36AM +0200, Marcin Ślusarz wrote: > > > > Network card still locks up (tested on

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-07 Thread Jarek Poplawski
On Mon, Aug 06, 2007 at 01:43:48PM -0400, Chuck Ebbert wrote: > On 08/06/2007 03:03 AM, Ingo Molnar wrote: > > > > But, since level types don't need this retriggers too much I think > > this "don't mask interrupts by default" idea should be rethinked: > > is there enough gain to risk such hard to

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-07 Thread Jarek Poplawski
On Tue, Aug 07, 2007 at 11:37:01AM +0200, Marcin Ślusarz wrote: > 2007/8/7, Jarek Poplawski <[EMAIL PROTECTED]>: > > On Tue, Aug 07, 2007 at 09:46:36AM +0200, Marcin Ślusarz wrote: > > > Network card still locks up (tested on 2.6.22.1). I had to upload more > > > data than usual (~350 MB vs ~1-100

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-07 Thread Jarek Poplawski
On Tue, Aug 07, 2007 at 11:21:07AM +0200, Jean-Baptiste Vignaud wrote: > > > > * interrupts (i use irqbalance, but problem was the same without) > > > > I wonder if you tried without SMP too? > > No i did not. Do you think that this can be a problem ? > To test with no SMP, do i need to

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-07 Thread Jean-Baptiste Vignaud
> > * interrupts (i use irqbalance, but problem was the same without) > > I wonder if you tried without SMP too? No i did not. Do you think that this can be a problem ? To test with no SMP, do i need to recompile kernel or is there a kernel parameter ? > BTW, Jean-Baptiste and Chuck -

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-07 Thread Jarek Poplawski
On Tue, Aug 07, 2007 at 10:10:34AM +0200, Jean-Baptiste Vignaud wrote: > > > BTW: Jean-Babtiste, could you send or point to you current configs? Oops! I'm very sorry for misspelling! > > I mean at least proc/interrupts, but with dmesg and .config it would > > be even better. (I assume this last

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-07 Thread Jarek Poplawski
On Tue, Aug 07, 2007 at 09:46:36AM +0200, Marcin Ślusarz wrote: > 2007/8/6, Ingo Molnar <[EMAIL PROTECTED]>: > > (..) > > please try Jarek's second patch too - there was a missing unmask. > > > > Ingo > > > > --> > > Subject: genirq: fix simple and fasteoi irq handlers > >

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-07 Thread Jean-Baptiste Vignaud
> BTW: Jean-Babtiste, could you send or point to you current configs? > I mean at least proc/interrupts, but with dmesg and .config it would > be even better. (I assume this last report was about the revert patch > mentioned by Chuck, not the one below your message?) Sure. Last reports are

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-07 Thread Marcin Ślusarz
2007/8/6, Ingo Molnar <[EMAIL PROTECTED]>: > (..) > please try Jarek's second patch too - there was a missing unmask. > > Ingo > > --> > Subject: genirq: fix simple and fasteoi irq handlers > From: Jarek Poplawski <[EMAIL PROTECTED]> > > After the "genirq: do not mask

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-07 Thread Jarek Poplawski
On Mon, Aug 06, 2007 at 05:19:03PM -0400, Chuck Ebbert wrote: > On 08/06/2007 04:42 PM, Jean-Baptiste Vignaud wrote: > > Mmm, bad news, after 4 hours of intensive network stressing, one of the 2 > > 3com card failed with the latest fedora kernel. > > > > Aug 6 22:31:09 loki kernel: NETDEV

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-07 Thread Jarek Poplawski
On Mon, Aug 06, 2007 at 05:19:03PM -0400, Chuck Ebbert wrote: On 08/06/2007 04:42 PM, Jean-Baptiste Vignaud wrote: Mmm, bad news, after 4 hours of intensive network stressing, one of the 2 3com card failed with the latest fedora kernel. Aug 6 22:31:09 loki kernel: NETDEV WATCHDOG:

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-07 Thread Marcin Ślusarz
2007/8/6, Ingo Molnar [EMAIL PROTECTED]: (..) please try Jarek's second patch too - there was a missing unmask. Ingo -- Subject: genirq: fix simple and fasteoi irq handlers From: Jarek Poplawski [EMAIL PROTECTED] After the genirq: do not mask interrupts by default

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-07 Thread Jarek Poplawski
On Tue, Aug 07, 2007 at 09:46:36AM +0200, Marcin Ślusarz wrote: 2007/8/6, Ingo Molnar [EMAIL PROTECTED]: (..) please try Jarek's second patch too - there was a missing unmask. Ingo -- Subject: genirq: fix simple and fasteoi irq handlers From: Jarek Poplawski

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-07 Thread Jean-Baptiste Vignaud
BTW: Jean-Babtiste, could you send or point to you current configs? I mean at least proc/interrupts, but with dmesg and .config it would be even better. (I assume this last report was about the revert patch mentioned by Chuck, not the one below your message?) Sure. Last reports are with

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-07 Thread Jean-Baptiste Vignaud
* interrupts (i use irqbalance, but problem was the same without) I wonder if you tried without SMP too? No i did not. Do you think that this can be a problem ? To test with no SMP, do i need to recompile kernel or is there a kernel parameter ? BTW, Jean-Baptiste and Chuck - it

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-07 Thread Jarek Poplawski
On Tue, Aug 07, 2007 at 10:10:34AM +0200, Jean-Baptiste Vignaud wrote: BTW: Jean-Babtiste, could you send or point to you current configs? Oops! I'm very sorry for misspelling! I mean at least proc/interrupts, but with dmesg and .config it would be even better. (I assume this last report

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-07 Thread Jarek Poplawski
On Tue, Aug 07, 2007 at 11:21:07AM +0200, Jean-Baptiste Vignaud wrote: * interrupts (i use irqbalance, but problem was the same without) I wonder if you tried without SMP too? No i did not. Do you think that this can be a problem ? To test with no SMP, do i need to recompile kernel or

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-07 Thread Jarek Poplawski
On Tue, Aug 07, 2007 at 11:37:01AM +0200, Marcin Ślusarz wrote: 2007/8/7, Jarek Poplawski [EMAIL PROTECTED]: On Tue, Aug 07, 2007 at 09:46:36AM +0200, Marcin Ślusarz wrote: Network card still locks up (tested on 2.6.22.1). I had to upload more data than usual (~350 MB vs ~1-100 MB) to

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-07 Thread Jarek Poplawski
On Mon, Aug 06, 2007 at 01:43:48PM -0400, Chuck Ebbert wrote: On 08/06/2007 03:03 AM, Ingo Molnar wrote: But, since level types don't need this retriggers too much I think this don't mask interrupts by default idea should be rethinked: is there enough gain to risk such hard to diagnose

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-07 Thread Jarek Poplawski
On Tue, Aug 07, 2007 at 11:52:46AM +0200, Jarek Poplawski wrote: On Tue, Aug 07, 2007 at 11:37:01AM +0200, Marcin Ślusarz wrote: 2007/8/7, Jarek Poplawski [EMAIL PROTECTED]: On Tue, Aug 07, 2007 at 09:46:36AM +0200, Marcin Ślusarz wrote: Network card still locks up (tested on 2.6.22.1).

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-07 Thread Jarek Poplawski
On Tue, Aug 07, 2007 at 02:13:39PM +0200, Jarek Poplawski wrote: On Tue, Aug 07, 2007 at 11:52:46AM +0200, Jarek Poplawski wrote: On Tue, Aug 07, 2007 at 11:37:01AM +0200, Marcin Ślusarz wrote: ... No, i don't need a break. I'll have more time in next weeks. Great! So, I'll try to send

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-07 Thread Jean-Baptiste Vignaud
On Tue, Aug 07, 2007 at 11:21:07AM +0200, Jean-Baptiste Vignaud wrote: * interrupts (i use irqbalance, but problem was the same without) I wonder if you tried without SMP too? No i did not. Do you think that this can be a problem ? To test with no SMP, do i need to recompile

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-06 Thread Al Boldi
Jean-Baptiste Vignaud wrote: > Mmm, bad news, after 4 hours of intensive network stressing, one of the 2 > 3com card failed with the latest fedora kernel. > > Aug 6 22:31:09 loki kernel: NETDEV WATCHDOG: eth2: transmit timed out > Aug 6 22:31:09 loki kernel: eth2: transmit timed out, tx_status

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-06 Thread Chuck Ebbert
On 08/06/2007 04:42 PM, Jean-Baptiste Vignaud wrote: > Mmm, bad news, after 4 hours of intensive network stressing, one of the 2 > 3com card failed with the latest fedora kernel. > > Aug 6 22:31:09 loki kernel: NETDEV WATCHDOG: eth2: transmit timed out > Aug 6 22:31:09 loki kernel: eth2:

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-06 Thread Jean-Baptiste Vignaud
Mmm, bad news, after 4 hours of intensive network stressing, one of the 2 3com card failed with the latest fedora kernel. Aug 6 22:31:09 loki kernel: NETDEV WATCHDOG: eth2: transmit timed out Aug 6 22:31:09 loki kernel: eth2: transmit timed out, tx_status 00 status e601. Aug 6 22:31:09 loki

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-06 Thread Jean-Baptiste Vignaud
> * Chuck Ebbert <[EMAIL PROTECTED]> wrote: > > > Before, they would print: > > > > eth0: transmit timed out, tx_status 00 status e601. > > diagnostics: net 0ccc media 8880 dma 003a fifo > > eth0: Interrupt posted but not delivered -- IRQ blocked by another device? > > Flags;

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-06 Thread Ingo Molnar
* Chuck Ebbert <[EMAIL PROTECTED]> wrote: > Before, they would print: > > eth0: transmit timed out, tx_status 00 status e601. > diagnostics: net 0ccc media 8880 dma 003a fifo > eth0: Interrupt posted but not delivered -- IRQ blocked by another device? > Flags; bus-master 1, dirty

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-06 Thread Chuck Ebbert
On 08/06/2007 03:03 AM, Ingo Molnar wrote: > > But, since level types don't need this retriggers too much I think > this "don't mask interrupts by default" idea should be rethinked: > is there enough gain to risk such hard to diagnose errors? > > I reverted those masking changes in Fedora

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-06 Thread Ingo Molnar
* Marcin Ślusarz <[EMAIL PROTECTED]> wrote: > 2007/7/31, Jarek Poplawski <[EMAIL PROTECTED]>: > > Marcin, > > > > I see you're quite busy, but if after testing this next Ingo's patch > > you are alive yet, maybe you could try one more "idea"? No patch this > > time, but if you could try this

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-06 Thread Marcin Ślusarz
2007/7/31, Jarek Poplawski <[EMAIL PROTECTED]>: > Marcin, > > I see you're quite busy, but if after testing this next Ingo's patch > you are alive yet, maybe you could try one more "idea"? No patch this > time, but if you could try this after adding boot option "noirqdebug" > (I'd like to be sure

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-06 Thread Marcin Ślusarz
2007/8/1, Ingo Molnar <[EMAIL PROTECTED]>: > ok, it wasnt supposed to be _that_ easy i guess :-) Can you please > (re-)confirm that the workaround below indeed fixes the hung card > problem? (after producing a single WARN_ON message into the syslog) yes, with this patch everything works fine end

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-06 Thread Marcin Ślusarz
2007/8/1, Ingo Molnar [EMAIL PROTECTED]: ok, it wasnt supposed to be _that_ easy i guess :-) Can you please (re-)confirm that the workaround below indeed fixes the hung card problem? (after producing a single WARN_ON message into the syslog) yes, with this patch everything works fine end of

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-06 Thread Marcin Ślusarz
2007/7/31, Jarek Poplawski [EMAIL PROTECTED]: Marcin, I see you're quite busy, but if after testing this next Ingo's patch you are alive yet, maybe you could try one more idea? No patch this time, but if you could try this after adding boot option noirqdebug (I'd like to be sure it's not

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-06 Thread Ingo Molnar
* Marcin Ślusarz [EMAIL PROTECTED] wrote: 2007/7/31, Jarek Poplawski [EMAIL PROTECTED]: Marcin, I see you're quite busy, but if after testing this next Ingo's patch you are alive yet, maybe you could try one more idea? No patch this time, but if you could try this after adding boot

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-06 Thread Chuck Ebbert
On 08/06/2007 03:03 AM, Ingo Molnar wrote: But, since level types don't need this retriggers too much I think this don't mask interrupts by default idea should be rethinked: is there enough gain to risk such hard to diagnose errors? I reverted those masking changes in Fedora and the

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-06 Thread Ingo Molnar
* Chuck Ebbert [EMAIL PROTECTED] wrote: Before, they would print: eth0: transmit timed out, tx_status 00 status e601. diagnostics: net 0ccc media 8880 dma 003a fifo eth0: Interrupt posted but not delivered -- IRQ blocked by another device? Flags; bus-master 1, dirty

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-06 Thread Jean-Baptiste Vignaud
* Chuck Ebbert [EMAIL PROTECTED] wrote: Before, they would print: eth0: transmit timed out, tx_status 00 status e601. diagnostics: net 0ccc media 8880 dma 003a fifo eth0: Interrupt posted but not delivered -- IRQ blocked by another device? Flags; bus-master 1, dirty

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-06 Thread Jean-Baptiste Vignaud
Mmm, bad news, after 4 hours of intensive network stressing, one of the 2 3com card failed with the latest fedora kernel. Aug 6 22:31:09 loki kernel: NETDEV WATCHDOG: eth2: transmit timed out Aug 6 22:31:09 loki kernel: eth2: transmit timed out, tx_status 00 status e601. Aug 6 22:31:09 loki

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-06 Thread Chuck Ebbert
On 08/06/2007 04:42 PM, Jean-Baptiste Vignaud wrote: Mmm, bad news, after 4 hours of intensive network stressing, one of the 2 3com card failed with the latest fedora kernel. Aug 6 22:31:09 loki kernel: NETDEV WATCHDOG: eth2: transmit timed out Aug 6 22:31:09 loki kernel: eth2: transmit

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-06 Thread Al Boldi
Jean-Baptiste Vignaud wrote: Mmm, bad news, after 4 hours of intensive network stressing, one of the 2 3com card failed with the latest fedora kernel. Aug 6 22:31:09 loki kernel: NETDEV WATCHDOG: eth2: transmit timed out Aug 6 22:31:09 loki kernel: eth2: transmit timed out, tx_status 00

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-01 Thread Ingo Molnar
* Marcin Ślusarz <[EMAIL PROTECTED]> wrote: > > ei_outb_p(ENISR_ALL, e8390_base + EN0_IMR); > > + /* force POST: */ > > + ei_inb_p(e8390_base + EN0_IMR); > > > > spin_unlock(_local->page_lock); > > enable_irq_lockdep_irqrestore(dev->irq, ); > > > > Bad news.

Re: 2.6.20->2.6.21 - networking dies after random time

2007-08-01 Thread Marcin Ślusarz
2007/7/30, Ingo Molnar <[EMAIL PROTECTED]>: > (..) > does the patch below fix those timeouts? It tests the theory whether any > POST latency could expose this problem. > > Ingo > > Index: linux/drivers/net/lib8390.c > === >

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-01 Thread Marcin Ślusarz
2007/7/30, Ingo Molnar [EMAIL PROTECTED]: (..) does the patch below fix those timeouts? It tests the theory whether any POST latency could expose this problem. Ingo Index: linux/drivers/net/lib8390.c === ---

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-01 Thread Ingo Molnar
* Marcin Ślusarz [EMAIL PROTECTED] wrote: ei_outb_p(ENISR_ALL, e8390_base + EN0_IMR); + /* force POST: */ + ei_inb_p(e8390_base + EN0_IMR); spin_unlock(ei_local-page_lock); enable_irq_lockdep_irqrestore(dev-irq, flags); Bad news. It doesn't

Re: 2.6.20->2.6.21 - networking dies after random time

2007-07-31 Thread Jarek Poplawski
On Mon, Jul 30, 2007 at 09:29:38AM +0200, Marcin Ślusarz wrote: ... > ps: I retested all patches posted in this thread on top of 2.6.22.1 > and behavior from 2.6.21.3 didn't changed. My next tests will be on > 2.6.22.x only. Marcin, I see you're quite busy, but if after testing this next Ingo's

Re: 2.6.20-2.6.21 - networking dies after random time

2007-07-31 Thread Jarek Poplawski
On Mon, Jul 30, 2007 at 09:29:38AM +0200, Marcin Ślusarz wrote: ... ps: I retested all patches posted in this thread on top of 2.6.22.1 and behavior from 2.6.21.3 didn't changed. My next tests will be on 2.6.22.x only. Marcin, I see you're quite busy, but if after testing this next Ingo's

Re: [PATCH][netdrvr] lib8390: comment on locking by Alan Cox Re: 2.6.20->2.6.21 - networking dies after random time

2007-07-30 Thread Jeff Garzik
Jarek Poplawski wrote: Hi, Very below is my patch proposal with a comment, which in my opinion is precious enough to save it for future help in reading and understanding the code. I hope Alan will not blame me I've not asked for his permission before sending, and he would ack this patch as it

Re: 2.6.20->2.6.21 - networking dies after random time

2007-07-30 Thread Alan Cox
> So the whole locking is to be able to keep irqs enabled for a long time, > without risking entry of the same IRQ handler on this same CPU, correct? As implemented - on any CPU. We also need to know that the IRQ handler is not doing useful work on another processor which is why we take the

Re: 2.6.20->2.6.21 - networking dies after random time

2007-07-30 Thread Ingo Molnar
* Marcin Ślusarz <[EMAIL PROTECTED]> wrote: > > Subject: x86: activate HARDIRQS_SW_RESEND > > From: Ingo Molnar <[EMAIL PROTECTED]> > > > > activate the software-triggered IRQ-resend logic. > This patch didn't help (tested on 2.6.22.1) - ne2k_pci timed out. ok. This makes it more likely that

Re: 2.6.20->2.6.21 - networking dies after random time

2007-07-30 Thread Ingo Molnar
* Alan Cox <[EMAIL PROTECTED]> wrote: > Ok the logic behind the 8390 is very simple: thanks for the explanation Alan! A few comments and a question: > Things to know > - IRQ delivery is asynchronous to the PCI bus > - Blocking the local CPU IRQ via spin locks was too slow > -

Re: 2.6.20->2.6.21 - networking dies after random time

2007-07-30 Thread Marcin Ślusarz
2007/7/26, Ingo Molnar <[EMAIL PROTECTED]>: > (..) > yeah - i meant to cover both arches but forgot about x86_64 - updated > patch attached below. > > Ingo > > -> > Subject: x86: activate HARDIRQS_SW_RESEND > From: Ingo Molnar <[EMAIL PROTECTED]> > > activate the

Re: 2.6.20-2.6.21 - networking dies after random time

2007-07-30 Thread Marcin Ślusarz
2007/7/26, Ingo Molnar [EMAIL PROTECTED]: (..) yeah - i meant to cover both arches but forgot about x86_64 - updated patch attached below. Ingo - Subject: x86: activate HARDIRQS_SW_RESEND From: Ingo Molnar [EMAIL PROTECTED] activate the software-triggered

  1   2   >