Re: [patch] 2.4.1, 2.4.2-pre3: APIC lockups

2001-02-14 Thread Jeff Garzik



On Wed, 14 Feb 2001, Roeland Th. Jansen wrote:

> On Tue, Feb 13, 2001 at 09:13:10PM +0100, Maciej W. Rozycki wrote:
> >  Please test it extensively, as much as you can, before I submit it for
> > inclusion.  If you ever get "Aieee!!!  Remote IRR still set after unlock!" 
> > message, please report it to me immediately -- it means the code failed. 
> 
> 
> ok, so far so good.
> 
> > There is also an additional debugging/statistics counter provided in
> > /proc/cpuinfo that counts interrupts which got delivered with its trigger
> > mode mismatched.  Check it out to find if you get any misdelivered
> > interrupts at all.
> 
> currently attacking the box with a flood ping. I used a pristine 2.4.1.
> to be sure I didn't leave stuff and applied the patch.

ping -l is a good test also...

Jeff




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] 2.4.1, 2.4.2-pre3: APIC lockups

2001-02-14 Thread Roeland Th. Jansen

On Wed, Feb 14, 2001 at 05:30:57PM +, Roeland Th. Jansen wrote:
> other observations -- approx 6000 ints from the ne2k card/sec.
> MIS shows approx 1% that goes wrong with a ping flood.

oops. had to count both CPU0 and CPU1's interrupts. after 23 minutes :

   CPU0   CPU1
 19:38241143823371   IO-APIC-level  eth0
MIS:  29025

makes approx 0.3%..

-- 
Grobbebol's Home   |  Don't give in to spammers.   -o)
http://www.xs4all.nl/~bengel   | Use your real e-mail address   /\
Linux 2.2.16 SMP 2x466MHz / 256 MB |on Usenet. _\_v  
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] 2.4.1, 2.4.2-pre3: APIC lockups

2001-02-14 Thread Roeland Th. Jansen

On Tue, Feb 13, 2001 at 09:13:10PM +0100, Maciej W. Rozycki wrote:
>  Please test it extensively, as much as you can, before I submit it for
> inclusion.  If you ever get "Aieee!!!  Remote IRR still set after unlock!" 
> message, please report it to me immediately -- it means the code failed. 


ok, so far so good.

> There is also an additional debugging/statistics counter provided in
> /proc/cpuinfo that counts interrupts which got delivered with its trigger
> mode mismatched.  Check it out to find if you get any misdelivered
> interrupts at all.

currently attacking the box with a flood ping. I used a pristine 2.4.1.
to be sure I didn't leave stuff and applied the patch.

observations -- system doesn't crash; usually I had to use disable focus
processor -- else it fails.

other observations -- approx 6000 ints from the ne2k card/sec.
MIS shows approx 1% that goes wrong with a ping flood.

   CPU0   CPU1
  0:  35345  36195IO-APIC-edge  timer
  1:   1632   1534IO-APIC-edge  keyboard
  2:  0  0  XT-PIC  cascade
  3:826832IO-APIC-edge  serial
  4:  4  4IO-APIC-edge  serial
  5:  12213  12201IO-APIC-edge  soundblaster
  8:  0  1IO-APIC-edge  rtc
 14:   3079   2906IO-APIC-edge  ide0
 15:  3  3IO-APIC-edge  ide1
 18: 69 85   IO-APIC-level  BusLogic BT-930
 19:17582801758266   IO-APIC-level  eth0
NMI:  71480  71480
LOC:  71459  71456
ERR:  3
MIS:  15814


good work !




-- 
Grobbebol's Home   |  Don't give in to spammers.   -o)
http://www.xs4all.nl/~bengel   | Use your real e-mail address   /\
Linux 2.2.16 SMP 2x466MHz / 256 MB |on Usenet. _\_v  
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] 2.4.1, 2.4.2-pre3: APIC lockups

2001-02-14 Thread Maciej W. Rozycki

On Wed, 14 Feb 2001, Andrew Morton wrote:

> Tell me, please: what tradeoffs are involved in this patch?
> Obviously it works around a pretty fatal problem, but
> what are we giving away?

 The change decreases performance a bit.  For well-behaved systems the
loss is fifteen instructions: a local APIC read (uncached but supposedly
cheap), a global memory read (a cache line invalidation and fetch), seven
stack accesses (cached for sure), a taken branch and five ALU.  With the
version you have I see gcc is actually doing an extra memory read due to
the volatile APIC access presumably -- this is now fixed.

 For misdelivered interrupts the overhead is much, much bigger, involving
acquiring a spinlock and multiple (uncached and possibly slow) I/O APIC
accesses.  We may lower the overhead by undefining APIC_LOCKUP_DEBUG,
which we should do after a bit of testing.  I think we might leave
APIC_MISMATCH_DEBUG intact -- its cost is a single locked instruction
which is negligible IMO.

 Note the original version consisted of two instructions only -- a local
APIC write and "ret", sigh...

  Maciej

-- 
+  Maciej W. Rozycki, Technical University of Gdansk, Poland   +
+--+
+e-mail: [EMAIL PROTECTED], PGP key available+

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] 2.4.1, 2.4.2-pre3: APIC lockups

2001-02-14 Thread Andrew Morton

"Maciej W. Rozycki" wrote:
> 
> Hi,
> 
>  After performing various tests I came to the following workaround for
> APIC lockups which people observe under IRQ load, mostly for networking
> stuff.

Works fine on the dual-PII.  No "Aieee!!!" messages at all.

After sending a few gigs across the ethernet, running
irq-whacker:

mnm:/usr/src/cptimer> cat /proc/interrupts
   CPU0   CPU1
  0:  77613  61869IO-APIC-edge  timer
  1:253258IO-APIC-edge  keyboard
  2:  0  0  XT-PIC  cascade
  8:  0  1IO-APIC-edge  rtc
  9:  0  0  XT-PIC  acpi
 12:  0  0IO-APIC-edge  PS/2 Mouse
 17:51048553919759   IO-APIC-level  eth0
 18:   2334   2313   IO-APIC-level  ide2
NMI: 139418 139418
LOC: 139403 139402
ERR:221
MIS:5299867

And without irq-whacker:

mnm:/home/morton> cat /proc/interrupts
   CPU0   CPU1
  0:  55384  70899IO-APIC-edge  timer
  1:  2  3IO-APIC-edge  keyboard
  2:  0  0  XT-PIC  cascade
  8:  0  1IO-APIC-edge  rtc
  9:  0  0  XT-PIC  acpi
 12:  0  0IO-APIC-edge  PS/2 Mouse
 17:25547052554064   IO-APIC-level  eth0
 18:   1814   1812   IO-APIC-level  ide2
NMI: 126220 126220
LOC: 126202 126201
ERR: 35
MIS:  0


Tell me, please: what tradeoffs are involved in this patch?
Obviously it works around a pretty fatal problem, but
what are we giving away?

Oh: and thanks :)

-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] 2.4.1, 2.4.2-pre3: APIC lockups

2001-02-14 Thread Andrew Morton

"Maciej W. Rozycki" wrote:
 
 Hi,
 
  After performing various tests I came to the following workaround for
 APIC lockups which people observe under IRQ load, mostly for networking
 stuff.

Works fine on the dual-PII.  No "Aieee!!!" messages at all.

After sending a few gigs across the ethernet, running
irq-whacker:

mnm:/usr/src/cptimer cat /proc/interrupts
   CPU0   CPU1
  0:  77613  61869IO-APIC-edge  timer
  1:253258IO-APIC-edge  keyboard
  2:  0  0  XT-PIC  cascade
  8:  0  1IO-APIC-edge  rtc
  9:  0  0  XT-PIC  acpi
 12:  0  0IO-APIC-edge  PS/2 Mouse
 17:51048553919759   IO-APIC-level  eth0
 18:   2334   2313   IO-APIC-level  ide2
NMI: 139418 139418
LOC: 139403 139402
ERR:221
MIS:5299867

And without irq-whacker:

mnm:/home/morton cat /proc/interrupts
   CPU0   CPU1
  0:  55384  70899IO-APIC-edge  timer
  1:  2  3IO-APIC-edge  keyboard
  2:  0  0  XT-PIC  cascade
  8:  0  1IO-APIC-edge  rtc
  9:  0  0  XT-PIC  acpi
 12:  0  0IO-APIC-edge  PS/2 Mouse
 17:25547052554064   IO-APIC-level  eth0
 18:   1814   1812   IO-APIC-level  ide2
NMI: 126220 126220
LOC: 126202 126201
ERR: 35
MIS:  0


Tell me, please: what tradeoffs are involved in this patch?
Obviously it works around a pretty fatal problem, but
what are we giving away?

Oh: and thanks :)

-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] 2.4.1, 2.4.2-pre3: APIC lockups

2001-02-14 Thread Maciej W. Rozycki

On Wed, 14 Feb 2001, Andrew Morton wrote:

 Tell me, please: what tradeoffs are involved in this patch?
 Obviously it works around a pretty fatal problem, but
 what are we giving away?

 The change decreases performance a bit.  For well-behaved systems the
loss is fifteen instructions: a local APIC read (uncached but supposedly
cheap), a global memory read (a cache line invalidation and fetch), seven
stack accesses (cached for sure), a taken branch and five ALU.  With the
version you have I see gcc is actually doing an extra memory read due to
the volatile APIC access presumably -- this is now fixed.

 For misdelivered interrupts the overhead is much, much bigger, involving
acquiring a spinlock and multiple (uncached and possibly slow) I/O APIC
accesses.  We may lower the overhead by undefining APIC_LOCKUP_DEBUG,
which we should do after a bit of testing.  I think we might leave
APIC_MISMATCH_DEBUG intact -- its cost is a single locked instruction
which is negligible IMO.

 Note the original version consisted of two instructions only -- a local
APIC write and "ret", sigh...

  Maciej

-- 
+  Maciej W. Rozycki, Technical University of Gdansk, Poland   +
+--+
+e-mail: [EMAIL PROTECTED], PGP key available+

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] 2.4.1, 2.4.2-pre3: APIC lockups

2001-02-14 Thread Roeland Th. Jansen

On Tue, Feb 13, 2001 at 09:13:10PM +0100, Maciej W. Rozycki wrote:
  Please test it extensively, as much as you can, before I submit it for
 inclusion.  If you ever get "Aieee!!!  Remote IRR still set after unlock!" 
 message, please report it to me immediately -- it means the code failed. 


ok, so far so good.

 There is also an additional debugging/statistics counter provided in
 /proc/cpuinfo that counts interrupts which got delivered with its trigger
 mode mismatched.  Check it out to find if you get any misdelivered
 interrupts at all.

currently attacking the box with a flood ping. I used a pristine 2.4.1.
to be sure I didn't leave stuff and applied the patch.

observations -- system doesn't crash; usually I had to use disable focus
processor -- else it fails.

other observations -- approx 6000 ints from the ne2k card/sec.
MIS shows approx 1% that goes wrong with a ping flood.

   CPU0   CPU1
  0:  35345  36195IO-APIC-edge  timer
  1:   1632   1534IO-APIC-edge  keyboard
  2:  0  0  XT-PIC  cascade
  3:826832IO-APIC-edge  serial
  4:  4  4IO-APIC-edge  serial
  5:  12213  12201IO-APIC-edge  soundblaster
  8:  0  1IO-APIC-edge  rtc
 14:   3079   2906IO-APIC-edge  ide0
 15:  3  3IO-APIC-edge  ide1
 18: 69 85   IO-APIC-level  BusLogic BT-930
 19:17582801758266   IO-APIC-level  eth0
NMI:  71480  71480
LOC:  71459  71456
ERR:  3
MIS:  15814


good work !




-- 
Grobbebol's Home   |  Don't give in to spammers.   -o)
http://www.xs4all.nl/~bengel   | Use your real e-mail address   /\
Linux 2.2.16 SMP 2x466MHz / 256 MB |on Usenet. _\_v  
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] 2.4.1, 2.4.2-pre3: APIC lockups

2001-02-14 Thread Roeland Th. Jansen

On Wed, Feb 14, 2001 at 05:30:57PM +, Roeland Th. Jansen wrote:
 other observations -- approx 6000 ints from the ne2k card/sec.
 MIS shows approx 1% that goes wrong with a ping flood.

oops. had to count both CPU0 and CPU1's interrupts. after 23 minutes :

   CPU0   CPU1
 19:38241143823371   IO-APIC-level  eth0
MIS:  29025

makes approx 0.3%..

-- 
Grobbebol's Home   |  Don't give in to spammers.   -o)
http://www.xs4all.nl/~bengel   | Use your real e-mail address   /\
Linux 2.2.16 SMP 2x466MHz / 256 MB |on Usenet. _\_v  
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] 2.4.1, 2.4.2-pre3: APIC lockups

2001-02-14 Thread Jeff Garzik



On Wed, 14 Feb 2001, Roeland Th. Jansen wrote:

 On Tue, Feb 13, 2001 at 09:13:10PM +0100, Maciej W. Rozycki wrote:
   Please test it extensively, as much as you can, before I submit it for
  inclusion.  If you ever get "Aieee!!!  Remote IRR still set after unlock!" 
  message, please report it to me immediately -- it means the code failed. 
 
 
 ok, so far so good.
 
  There is also an additional debugging/statistics counter provided in
  /proc/cpuinfo that counts interrupts which got delivered with its trigger
  mode mismatched.  Check it out to find if you get any misdelivered
  interrupts at all.
 
 currently attacking the box with a flood ping. I used a pristine 2.4.1.
 to be sure I didn't leave stuff and applied the patch.

ping -l is a good test also...

Jeff




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] 2.4.1, 2.4.2-pre3: APIC lockups

2001-02-13 Thread Frank de Lange

On Tue, Feb 13, 2001 at 09:13:10PM +0100, Maciej W. Rozycki wrote:
> There is also an additional debugging/statistics counter provided in
> /proc/cpuinfo that counts interrupts which got delivered with its trigger
> mode mismatched.  Check it out to find if you get any misdelivered
> interrupts at all.

I guess you mean the MIS: counter in /proc/interrupts? This is what it says on
my box after running some 33 interrupts (at a rate of app. 900/second)
through the network/usb IRQ:

 cat /proc/interrupts 
   CPU0   CPU1   
  0:  31693  32749IO-APIC-edge  timer
  1:   1208   1174IO-APIC-edge  keyboard
  2:  0  0  XT-PIC  cascade
  3:113 26IO-APIC-edge  serial
  4:   4689   4567IO-APIC-edge  serial
 14:   4440   4545IO-APIC-edge  ide0
 15:   1911   2132IO-APIC-edge  ide1
 16:  85021  84227   IO-APIC-level  es1371, mga@PCI:1:0:0
 17: 26 26   IO-APIC-level  sym53c8xx
 18:  0  0   IO-APIC-level  btaudio, bttv
 19: 165467 166254   IO-APIC-level  eth0, eth1, usb-uhci
NMI:  64376  64376 
LOC:  64364  64362 
ERR:  0
MIS:647

So, that's about 650 misdelivered interrupts for 33 deliveries (the other
interrupts never gave me any trouble, so I guess the misdelivered ones are all
from IRQ 19), or about .2%

When I load the network and stream some audio over it, the sound becomes a bit
choppy. The MIS: counter only increases when the network (read: IRQ1() is
loaded, a single audio stream (app. 220 int/sec) causes no MISses to occur.

In general, I'd say the stability WITH the patch is good, and timeouts are
withing tolerable levels. If I need something better, I'll probably get myself
a better set of network cards...

So, quick conclusion, this seems a reasonable fix...

Cheers//Frank

-- 
  W  ___
 ## o o\/ Frank de Lange \
 }#   \|   /  \
  ##---# _/   \
      \  +31-320-252965/
   \[EMAIL PROTECTED]/
-
 [ "Omnis enim res, quae dando non deficit, dum habetur
et non datur, nondum habetur, quomodo habenda est."  ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] 2.4.1, 2.4.2-pre3: APIC lockups

2001-02-13 Thread Manfred Spraul

"Maciej W. Rozycki" wrote:
> 
> Hi,
> 
>  After performing various tests I came to the following workaround for
> APIC lockups which people observe under IRQ load, mostly for networking
> stuff.  I believe the test should work in all cases as it basically
> implements a manual replacement for EOI messages.  In my simulated
> environment I was unable to get a lockup with the code in place, even
> though I was getting about every other level-triggered IRQ misdelivered.
> 
>  Please test it extensively, as much as you can, before I submit it for
> inclusion.  If you ever get "Aieee!!!  Remote IRR still set after unlock!"
> message, please report it to me immediately -- it means the code failed.
>
No messages.

> There is also an additional debugging/statistics counter provided in
> /proc/cpuinfo that counts interrupts which got delivered with its trigger
> mode mismatched.  Check it out to find if you get any misdelivered
> interrupts at all.
> 
I'm running my default webserver load test, and I get ~40 /second, 92735
total.

bw_tcp says 1.13 MB/sec, that's wire speed.

tcpdump | grep 'sack ' doesn't show unusually many lost packets.

Look promising.

--
Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] 2.4.1, 2.4.2-pre3: APIC lockups

2001-02-13 Thread Frank de Lange

On Tue, Feb 13, 2001 at 09:13:10PM +0100, Maciej W. Rozycki wrote:
 There is also an additional debugging/statistics counter provided in
 /proc/cpuinfo that counts interrupts which got delivered with its trigger
 mode mismatched.  Check it out to find if you get any misdelivered
 interrupts at all.

I guess you mean the MIS: counter in /proc/interrupts? This is what it says on
my box after running some 33 interrupts (at a rate of app. 900/second)
through the network/usb IRQ:

 cat /proc/interrupts 
   CPU0   CPU1   
  0:  31693  32749IO-APIC-edge  timer
  1:   1208   1174IO-APIC-edge  keyboard
  2:  0  0  XT-PIC  cascade
  3:113 26IO-APIC-edge  serial
  4:   4689   4567IO-APIC-edge  serial
 14:   4440   4545IO-APIC-edge  ide0
 15:   1911   2132IO-APIC-edge  ide1
 16:  85021  84227   IO-APIC-level  es1371, mga@PCI:1:0:0
 17: 26 26   IO-APIC-level  sym53c8xx
 18:  0  0   IO-APIC-level  btaudio, bttv
 19: 165467 166254   IO-APIC-level  eth0, eth1, usb-uhci
NMI:  64376  64376 
LOC:  64364  64362 
ERR:  0
MIS:647

So, that's about 650 misdelivered interrupts for 33 deliveries (the other
interrupts never gave me any trouble, so I guess the misdelivered ones are all
from IRQ 19), or about .2%

When I load the network and stream some audio over it, the sound becomes a bit
choppy. The MIS: counter only increases when the network (read: IRQ1() is
loaded, a single audio stream (app. 220 int/sec) causes no MISses to occur.

In general, I'd say the stability WITH the patch is good, and timeouts are
withing tolerable levels. If I need something better, I'll probably get myself
a better set of network cards...

So, quick conclusion, this seems a reasonable fix...

Cheers//Frank

-- 
  W  ___
 ## o o\/ Frank de Lange \
 }#   \|   /  \
  ##---# _/ Hacker for Hire  \
      \  +31-320-252965/
   \[EMAIL PROTECTED]/
-
 [ "Omnis enim res, quae dando non deficit, dum habetur
et non datur, nondum habetur, quomodo habenda est."  ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/