Re: Possibly SATA related freeze killed networking and RAID

2007-12-10 Thread noah
2007/11/21, noah [EMAIL PROTECTED]:
 2007/11/21, Alan Cox [EMAIL PROTECTED]:
   I've had other freezes before but this was the first time I was able
   to see what was actually going on.
   IRQ 21 appears to be shared between sata_nv and ethernet.
  
   Does this mean my hardware/BIOS is broken somehow?
 
  Not neccessarily. It could a bug in one of the drivers using IRQ 21
  (sata_nv or the nvidia ethernet), it could be another inactive device, or
  it could be a hardware funny.

 How can I tell if there's an inactive device?

  Nvidia stuff can be quite hard to diagnose as we have no documentation
  but we can try. The first question is whether it is network or disk
  triggered - seeing if heavy loads to one or the other trigger the problem
  might be a first plan.

 I haven't managed to trigger it again yet but at the time the CPU was
 heavily loaded and I was re-indexing a database which caused a lot of
 disk activity. I'm quite confident the network was pretty much idle at
 the time.

The same thing has happened twice now, both during the weekly check of
the md0 and md1 RAID1-arrays. That is, networking on the primary
interface is dead. It's interrupt (irq 21) is shared between sata_nv
and forcedeth.

Is there anything I can do to debug this problem?

I don't have access to the logs right now but will have later.

  -- noah
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-12-03 Thread Phillip Susi

Tejun Heo wrote:

Surprise, surprise.  There's no way to tell whether the controller
raised interrupt or not if command is not in progress.  As I said
before, there's no IRQ pending bit.  While processing commands, you can
tell by looking at other status registers but when there's nothing in
flight and the controller determines it's a good time to raise a
spurious interrupt, there's no way you can tell.  That dang SFF
interface is like 15+ years old.

But we can still make things pretty robust.  We're working on it.

Thanks.



It sounds like you mean that you know the controller did NOT raise the 
interrupt ( intentionally/correctly ) if there was no command in 
progress, as opposed to not being able to tell.  Unless there is some 
condition under which it is valid for the controller to raise an 
interrupt when it had no commands in progress?  And if that's the case 
and there's know way to know WHY, that's a broken design.


-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-12-03 Thread Tejun Heo
Phillip Susi wrote:
 Tejun Heo wrote:
 Surprise, surprise.  There's no way to tell whether the controller
 raised interrupt or not if command is not in progress.  As I said
 before, there's no IRQ pending bit.  While processing commands, you can
 tell by looking at other status registers but when there's nothing in
 flight and the controller determines it's a good time to raise a
 spurious interrupt, there's no way you can tell.  That dang SFF
 interface is like 15+ years old.

 But we can still make things pretty robust.  We're working on it.
 
 It sounds like you mean that you know the controller did NOT raise the
 interrupt ( intentionally/correctly ) if there was no command in
 progress, as opposed to not being able to tell.  Unless there is some
 condition under which it is valid for the controller to raise an
 interrupt when it had no commands in progress?  And if that's the case
 and there's know way to know WHY, that's a broken design.

If everything works correctly, all interrupts can be accounted for.
It's just that there's no margin for erratic behaviors and most ATA
controllers are built really cheap.  So, yeah, it's a 15+ years old
half-broken design.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-11-30 Thread Pavel Machek
On Fri 2007-11-30 10:00:55, Mark Lord wrote:
 Pavel Machek wrote:
 On Fri 2007-11-30 13:13:44, Alan Cox wrote:
 Why does a single spurious interrupt cause it to be shut down?  I can 
 It doesn't.

 see if the interrupt is stuck on and keeps interrupting constantly, but 
 if it's just the occasional spurious interrupt, why not just ignore it 
 and move on?
 The interrupt is usually level triggered so it continues to create
 interrupts until you silence it. The thresholds are about 10,000
 interrupt events and on newer kernels we also reset the count if we don't
 see any for a while. That works for most stuff except the thinkpad
 bluetooth problem.
 Which is confirmed hw problem now, btw.
 ...

 What problem is that, exactly?

Spurious interrupt, interrupt link is disabled after ~15 minutes. It
seems pretty unique to t61.

 My Dell has an internal USB BT adapter that briefly appears
 and then disappears again on resume (or stays if I have enabled it
 via the BIOS key).

 I wonder if that has anything to do with the (new in) 2.6.23 pauses
 that machine has on resume (about every 10th time).

No idea, but t61 problem seems different.
Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-11-30 Thread Tejun Heo
Phillip Susi wrote:
 Tejun Heo wrote:
 Because SFF ATA controller don't have IRQ pending bit.  You don't know
 whether IRQ is raised or not.  Plus, accessing the status register which
 clears pending IRQ can be very slow on PATA machines.  It has to go
 through the PCI and ATA bus and come back.  So, unconditionally trying
 to clear IRQ by accessing Status can incur noticeable overhead if the
 IRQ is shared with devices which raise a lot of IRQs.
 
 There HAS to be a way to determine if that device generated the
 interrupt, or the interrupt can not be shared.  Since the kernel said
 nobody cared about the interrupt, that indicates that the sata driver
 checked the status register and realized the sata chip didn't generate
 the interrupt, and returned to the kernel letting it know that the
 interrupt was not for it.

Surprise, surprise.  There's no way to tell whether the controller
raised interrupt or not if command is not in progress.  As I said
before, there's no IRQ pending bit.  While processing commands, you can
tell by looking at other status registers but when there's nothing in
flight and the controller determines it's a good time to raise a
spurious interrupt, there's no way you can tell.  That dang SFF
interface is like 15+ years old.

But we can still make things pretty robust.  We're working on it.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-11-30 Thread Mark Lord

Pavel Machek wrote:

On Fri 2007-11-30 13:13:44, Alan Cox wrote:
Why does a single spurious interrupt cause it to be shut down?  I can 

It doesn't.

see if the interrupt is stuck on and keeps interrupting constantly, but 
if it's just the occasional spurious interrupt, why not just ignore it 
and move on?

The interrupt is usually level triggered so it continues to create
interrupts until you silence it. The thresholds are about 10,000
interrupt events and on newer kernels we also reset the count if we don't
see any for a while. That works for most stuff except the thinkpad
bluetooth problem.


Which is confirmed hw problem now, btw.

...

What problem is that, exactly?

My Dell has an internal USB BT adapter that briefly appears
and then disappears again on resume (or stays if I have enabled it
via the BIOS key).

I wonder if that has anything to do with the (new in) 2.6.23 pauses
that machine has on resume (about every 10th time).

Cheers
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-11-30 Thread Pavel Machek
On Fri 2007-11-30 13:13:44, Alan Cox wrote:
  Why does a single spurious interrupt cause it to be shut down?  I can 
 
 It doesn't.
 
  see if the interrupt is stuck on and keeps interrupting constantly, but 
  if it's just the occasional spurious interrupt, why not just ignore it 
  and move on?
 
 The interrupt is usually level triggered so it continues to create
 interrupts until you silence it. The thresholds are about 10,000
 interrupt events and on newer kernels we also reset the count if we don't
 see any for a while. That works for most stuff except the thinkpad
 bluetooth problem.

Which is confirmed hw problem now, btw.
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-11-30 Thread Alan Cox
 Why does a single spurious interrupt cause it to be shut down?  I can 

It doesn't.

 see if the interrupt is stuck on and keeps interrupting constantly, but 
 if it's just the occasional spurious interrupt, why not just ignore it 
 and move on?

The interrupt is usually level triggered so it continues to create
interrupts until you silence it. The thresholds are about 10,000
interrupt events and on newer kernels we also reset the count if we don't
see any for a while. That works for most stuff except the thinkpad
bluetooth problem.

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-11-30 Thread Phillip Susi

Tejun Heo wrote:

Because SFF ATA controller don't have IRQ pending bit.  You don't know
whether IRQ is raised or not.  Plus, accessing the status register which
clears pending IRQ can be very slow on PATA machines.  It has to go
through the PCI and ATA bus and come back.  So, unconditionally trying
to clear IRQ by accessing Status can incur noticeable overhead if the
IRQ is shared with devices which raise a lot of IRQs.


There HAS to be a way to determine if that device generated the 
interrupt, or the interrupt can not be shared.  Since the kernel said 
nobody cared about the interrupt, that indicates that the sata driver 
checked the status register and realized the sata chip didn't generate 
the interrupt, and returned to the kernel letting it know that the 
interrupt was not for it.


-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-11-29 Thread Phillip Susi

Tejun Heo wrote:

Agreed.  Nobody cared on ATA controllers is usually very effective at
taking the whole machine down.  Is there any reason why we don't turn on
irqpoll on turned off IRQs automatically?


Why does a single spurious interrupt cause it to be shut down?  I can 
see if the interrupt is stuck on and keeps interrupting constantly, but 
if it's just the occasional spurious interrupt, why not just ignore it 
and move on?


-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-11-29 Thread Tejun Heo
Phillip Susi wrote:
 Tejun Heo wrote:
 Agreed.  Nobody cared on ATA controllers is usually very effective at
 taking the whole machine down.  Is there any reason why we don't turn on
 irqpoll on turned off IRQs automatically?
 
 Why does a single spurious interrupt cause it to be shut down?  I can
 see if the interrupt is stuck on and keeps interrupting constantly, but
 if it's just the occasional spurious interrupt, why not just ignore it
 and move on?

Because SFF ATA controller don't have IRQ pending bit.  You don't know
whether IRQ is raised or not.  Plus, accessing the status register which
clears pending IRQ can be very slow on PATA machines.  It has to go
through the PCI and ATA bus and come back.  So, unconditionally trying
to clear IRQ by accessing Status can incur noticeable overhead if the
IRQ is shared with devices which raise a lot of IRQs.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-11-29 Thread Robert Hancock

Phillip Susi wrote:

Tejun Heo wrote:

Agreed.  Nobody cared on ATA controllers is usually very effective at
taking the whole machine down.  Is there any reason why we don't turn on
irqpoll on turned off IRQs automatically?


Why does a single spurious interrupt cause it to be shut down?  I can 
see if the interrupt is stuck on and keeps interrupting constantly, but 
if it's just the occasional spurious interrupt, why not just ignore it 
and move on?


I'm not certain offhand, but I think there may be such a threshold. 
However, an occasional spurious interrupt isn't likely. For a 
level-triggered interrupt, an unhandled interrupt will keep interrupting 
forever since nobody knows how to clear it (until we decide to disable 
the IRQ entirely).


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-11-27 Thread Tejun Heo
Pavel Machek wrote:
 Hi!
 
  kernel: [734344.717844] irq 21: nobody cared (try booting with the
 irqpoll option)
  kernel: [734344.717866]
 Your machine decided to emit interrupt 21 without an apparent reason.
 Whatever caused that made the kernel shut down IRQ 21 at which point the
 disk drives on that IRQ were no longer being serviced. Everything on IRQ
 21 would have died - which may be why your networking failed too.
 
 Hmm, perhaps that 'nobody cared' message should be worded more
 strongly, and printed and KERN_CRIT?

Agreed.  Nobody cared on ATA controllers is usually very effective at
taking the whole machine down.  Is there any reason why we don't turn on
irqpoll on turned off IRQs automatically?

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-11-26 Thread Pavel Machek
Hi!

   kernel: [734344.717844] irq 21: nobody cared (try booting with the
  irqpoll option)
   kernel: [734344.717866]
 
 Your machine decided to emit interrupt 21 without an apparent reason.
 Whatever caused that made the kernel shut down IRQ 21 at which point the
 disk drives on that IRQ were no longer being serviced. Everything on IRQ
 21 would have died - which may be why your networking failed too.

Hmm, perhaps that 'nobody cared' message should be worded more
strongly, and printed and KERN_CRIT?
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-11-21 Thread noah
2007/11/21, Alan Cox [EMAIL PROTECTED]:
  I've had other freezes before but this was the first time I was able
  to see what was actually going on.
  IRQ 21 appears to be shared between sata_nv and ethernet.
 
  Does this mean my hardware/BIOS is broken somehow?

 Not neccessarily. It could a bug in one of the drivers using IRQ 21
 (sata_nv or the nvidia ethernet), it could be another inactive device, or
 it could be a hardware funny.

How can I tell if there's an inactive device?

 Nvidia stuff can be quite hard to diagnose as we have no documentation
 but we can try. The first question is whether it is network or disk
 triggered - seeing if heavy loads to one or the other trigger the problem
 might be a first plan.

I haven't managed to trigger it again yet but at the time the CPU was
heavily loaded and I was re-indexing a database which caused a lot of
disk activity. I'm quite confident the network was pretty much idle at
the time.

  -- noah
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-11-20 Thread Alan Cox
  kernel: [734344.717844] irq 21: nobody cared (try booting with the
 irqpoll option)
  kernel: [734344.717866]

Your machine decided to emit interrupt 21 without an apparent reason.
Whatever caused that made the kernel shut down IRQ 21 at which point the
disk drives on that IRQ were no longer being serviced. Everything on IRQ
21 would have died - which may be why your networking failed too.

What do you have on IRQ 21 and is this a one off ?
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-11-20 Thread noah
2007/11/20, Alan Cox [EMAIL PROTECTED]:
   kernel: [734344.717844] irq 21: nobody cared (try booting with the
  irqpoll option)
   kernel: [734344.717866]

 Your machine decided to emit interrupt 21 without an apparent reason.
 Whatever caused that made the kernel shut down IRQ 21 at which point the
 disk drives on that IRQ were no longer being serviced. Everything on IRQ
 21 would have died - which may be why your networking failed too.

 What do you have on IRQ 21 and is this a one off ?

I've had other freezes before but this was the first time I was able
to see what was actually going on.
IRQ 21 appears to be shared between sata_nv and ethernet.

Does this mean my hardware/BIOS is broken somehow?
I'm running the latest BIOS available.

# cat /proc/interruptsCPU0   CPU1
  0:  264973603163   IO-APIC-edge  timer
  1:  0  2   IO-APIC-edge  i8042
  8:  0  0   IO-APIC-edge  rtc
 9:  0  0   IO-APIC-fasteoi   acpi
 12:  0  6   IO-APIC-edge  i8042
 16:   4851 669159   IO-APIC-fasteoi   shpchp, libata
 20:  0  0   IO-APIC-fasteoi   sata_nv
 21:  364434775430   IO-APIC-fasteoi   sata_nv, eth0
 22:  312614531218   IO-APIC-fasteoi   ohci_hcd:usb1, sata_nv
 23:  4   1649   IO-APIC-fasteoi   HDA Intel, ehci_hcd:usb2
NMI:  0  0
LOC:36295623629543
ERR:  0

  -- noah
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-11-20 Thread Alan Cox
 I've had other freezes before but this was the first time I was able
 to see what was actually going on.
 IRQ 21 appears to be shared between sata_nv and ethernet.
 
 Does this mean my hardware/BIOS is broken somehow?

Not neccessarily. It could a bug in one of the drivers using IRQ 21
(sata_nv or the nvidia ethernet), it could be another inactive device, or
it could be a hardware funny.

Nvidia stuff can be quite hard to diagnose as we have no documentation
but we can try. The first question is whether it is network or disk
triggered - seeing if heavy loads to one or the other trigger the problem
might be a first plan.


Alan
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html