Re: Possibly SATA related freeze killed networking and RAID
2007/11/21, noah [EMAIL PROTECTED]: 2007/11/21, Alan Cox [EMAIL PROTECTED]: I've had other freezes before but this was the first time I was able to see what was actually going on. IRQ 21 appears to be shared between sata_nv and ethernet. Does this mean my hardware/BIOS is broken somehow? Not neccessarily. It could a bug in one of the drivers using IRQ 21 (sata_nv or the nvidia ethernet), it could be another inactive device, or it could be a hardware funny. How can I tell if there's an inactive device? Nvidia stuff can be quite hard to diagnose as we have no documentation but we can try. The first question is whether it is network or disk triggered - seeing if heavy loads to one or the other trigger the problem might be a first plan. I haven't managed to trigger it again yet but at the time the CPU was heavily loaded and I was re-indexing a database which caused a lot of disk activity. I'm quite confident the network was pretty much idle at the time. The same thing has happened twice now, both during the weekly check of the md0 and md1 RAID1-arrays. That is, networking on the primary interface is dead. It's interrupt (irq 21) is shared between sata_nv and forcedeth. Is there anything I can do to debug this problem? I don't have access to the logs right now but will have later. -- noah - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possibly SATA related freeze killed networking and RAID
Tejun Heo wrote: Surprise, surprise. There's no way to tell whether the controller raised interrupt or not if command is not in progress. As I said before, there's no IRQ pending bit. While processing commands, you can tell by looking at other status registers but when there's nothing in flight and the controller determines it's a good time to raise a spurious interrupt, there's no way you can tell. That dang SFF interface is like 15+ years old. But we can still make things pretty robust. We're working on it. Thanks. It sounds like you mean that you know the controller did NOT raise the interrupt ( intentionally/correctly ) if there was no command in progress, as opposed to not being able to tell. Unless there is some condition under which it is valid for the controller to raise an interrupt when it had no commands in progress? And if that's the case and there's know way to know WHY, that's a broken design. - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possibly SATA related freeze killed networking and RAID
Phillip Susi wrote: Tejun Heo wrote: Surprise, surprise. There's no way to tell whether the controller raised interrupt or not if command is not in progress. As I said before, there's no IRQ pending bit. While processing commands, you can tell by looking at other status registers but when there's nothing in flight and the controller determines it's a good time to raise a spurious interrupt, there's no way you can tell. That dang SFF interface is like 15+ years old. But we can still make things pretty robust. We're working on it. It sounds like you mean that you know the controller did NOT raise the interrupt ( intentionally/correctly ) if there was no command in progress, as opposed to not being able to tell. Unless there is some condition under which it is valid for the controller to raise an interrupt when it had no commands in progress? And if that's the case and there's know way to know WHY, that's a broken design. If everything works correctly, all interrupts can be accounted for. It's just that there's no margin for erratic behaviors and most ATA controllers are built really cheap. So, yeah, it's a 15+ years old half-broken design. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possibly SATA related freeze killed networking and RAID
On Fri 2007-11-30 10:00:55, Mark Lord wrote: Pavel Machek wrote: On Fri 2007-11-30 13:13:44, Alan Cox wrote: Why does a single spurious interrupt cause it to be shut down? I can It doesn't. see if the interrupt is stuck on and keeps interrupting constantly, but if it's just the occasional spurious interrupt, why not just ignore it and move on? The interrupt is usually level triggered so it continues to create interrupts until you silence it. The thresholds are about 10,000 interrupt events and on newer kernels we also reset the count if we don't see any for a while. That works for most stuff except the thinkpad bluetooth problem. Which is confirmed hw problem now, btw. ... What problem is that, exactly? Spurious interrupt, interrupt link is disabled after ~15 minutes. It seems pretty unique to t61. My Dell has an internal USB BT adapter that briefly appears and then disappears again on resume (or stays if I have enabled it via the BIOS key). I wonder if that has anything to do with the (new in) 2.6.23 pauses that machine has on resume (about every 10th time). No idea, but t61 problem seems different. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possibly SATA related freeze killed networking and RAID
Phillip Susi wrote: Tejun Heo wrote: Because SFF ATA controller don't have IRQ pending bit. You don't know whether IRQ is raised or not. Plus, accessing the status register which clears pending IRQ can be very slow on PATA machines. It has to go through the PCI and ATA bus and come back. So, unconditionally trying to clear IRQ by accessing Status can incur noticeable overhead if the IRQ is shared with devices which raise a lot of IRQs. There HAS to be a way to determine if that device generated the interrupt, or the interrupt can not be shared. Since the kernel said nobody cared about the interrupt, that indicates that the sata driver checked the status register and realized the sata chip didn't generate the interrupt, and returned to the kernel letting it know that the interrupt was not for it. Surprise, surprise. There's no way to tell whether the controller raised interrupt or not if command is not in progress. As I said before, there's no IRQ pending bit. While processing commands, you can tell by looking at other status registers but when there's nothing in flight and the controller determines it's a good time to raise a spurious interrupt, there's no way you can tell. That dang SFF interface is like 15+ years old. But we can still make things pretty robust. We're working on it. Thanks. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possibly SATA related freeze killed networking and RAID
Pavel Machek wrote: On Fri 2007-11-30 13:13:44, Alan Cox wrote: Why does a single spurious interrupt cause it to be shut down? I can It doesn't. see if the interrupt is stuck on and keeps interrupting constantly, but if it's just the occasional spurious interrupt, why not just ignore it and move on? The interrupt is usually level triggered so it continues to create interrupts until you silence it. The thresholds are about 10,000 interrupt events and on newer kernels we also reset the count if we don't see any for a while. That works for most stuff except the thinkpad bluetooth problem. Which is confirmed hw problem now, btw. ... What problem is that, exactly? My Dell has an internal USB BT adapter that briefly appears and then disappears again on resume (or stays if I have enabled it via the BIOS key). I wonder if that has anything to do with the (new in) 2.6.23 pauses that machine has on resume (about every 10th time). Cheers - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possibly SATA related freeze killed networking and RAID
On Fri 2007-11-30 13:13:44, Alan Cox wrote: Why does a single spurious interrupt cause it to be shut down? I can It doesn't. see if the interrupt is stuck on and keeps interrupting constantly, but if it's just the occasional spurious interrupt, why not just ignore it and move on? The interrupt is usually level triggered so it continues to create interrupts until you silence it. The thresholds are about 10,000 interrupt events and on newer kernels we also reset the count if we don't see any for a while. That works for most stuff except the thinkpad bluetooth problem. Which is confirmed hw problem now, btw. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possibly SATA related freeze killed networking and RAID
Why does a single spurious interrupt cause it to be shut down? I can It doesn't. see if the interrupt is stuck on and keeps interrupting constantly, but if it's just the occasional spurious interrupt, why not just ignore it and move on? The interrupt is usually level triggered so it continues to create interrupts until you silence it. The thresholds are about 10,000 interrupt events and on newer kernels we also reset the count if we don't see any for a while. That works for most stuff except the thinkpad bluetooth problem. - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possibly SATA related freeze killed networking and RAID
Tejun Heo wrote: Because SFF ATA controller don't have IRQ pending bit. You don't know whether IRQ is raised or not. Plus, accessing the status register which clears pending IRQ can be very slow on PATA machines. It has to go through the PCI and ATA bus and come back. So, unconditionally trying to clear IRQ by accessing Status can incur noticeable overhead if the IRQ is shared with devices which raise a lot of IRQs. There HAS to be a way to determine if that device generated the interrupt, or the interrupt can not be shared. Since the kernel said nobody cared about the interrupt, that indicates that the sata driver checked the status register and realized the sata chip didn't generate the interrupt, and returned to the kernel letting it know that the interrupt was not for it. - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possibly SATA related freeze killed networking and RAID
Tejun Heo wrote: Agreed. Nobody cared on ATA controllers is usually very effective at taking the whole machine down. Is there any reason why we don't turn on irqpoll on turned off IRQs automatically? Why does a single spurious interrupt cause it to be shut down? I can see if the interrupt is stuck on and keeps interrupting constantly, but if it's just the occasional spurious interrupt, why not just ignore it and move on? - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possibly SATA related freeze killed networking and RAID
Phillip Susi wrote: Tejun Heo wrote: Agreed. Nobody cared on ATA controllers is usually very effective at taking the whole machine down. Is there any reason why we don't turn on irqpoll on turned off IRQs automatically? Why does a single spurious interrupt cause it to be shut down? I can see if the interrupt is stuck on and keeps interrupting constantly, but if it's just the occasional spurious interrupt, why not just ignore it and move on? Because SFF ATA controller don't have IRQ pending bit. You don't know whether IRQ is raised or not. Plus, accessing the status register which clears pending IRQ can be very slow on PATA machines. It has to go through the PCI and ATA bus and come back. So, unconditionally trying to clear IRQ by accessing Status can incur noticeable overhead if the IRQ is shared with devices which raise a lot of IRQs. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possibly SATA related freeze killed networking and RAID
Phillip Susi wrote: Tejun Heo wrote: Agreed. Nobody cared on ATA controllers is usually very effective at taking the whole machine down. Is there any reason why we don't turn on irqpoll on turned off IRQs automatically? Why does a single spurious interrupt cause it to be shut down? I can see if the interrupt is stuck on and keeps interrupting constantly, but if it's just the occasional spurious interrupt, why not just ignore it and move on? I'm not certain offhand, but I think there may be such a threshold. However, an occasional spurious interrupt isn't likely. For a level-triggered interrupt, an unhandled interrupt will keep interrupting forever since nobody knows how to clear it (until we decide to disable the IRQ entirely). -- Robert Hancock Saskatoon, SK, Canada To email, remove nospam from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possibly SATA related freeze killed networking and RAID
Pavel Machek wrote: Hi! kernel: [734344.717844] irq 21: nobody cared (try booting with the irqpoll option) kernel: [734344.717866] Your machine decided to emit interrupt 21 without an apparent reason. Whatever caused that made the kernel shut down IRQ 21 at which point the disk drives on that IRQ were no longer being serviced. Everything on IRQ 21 would have died - which may be why your networking failed too. Hmm, perhaps that 'nobody cared' message should be worded more strongly, and printed and KERN_CRIT? Agreed. Nobody cared on ATA controllers is usually very effective at taking the whole machine down. Is there any reason why we don't turn on irqpoll on turned off IRQs automatically? Thanks. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possibly SATA related freeze killed networking and RAID
Hi! kernel: [734344.717844] irq 21: nobody cared (try booting with the irqpoll option) kernel: [734344.717866] Your machine decided to emit interrupt 21 without an apparent reason. Whatever caused that made the kernel shut down IRQ 21 at which point the disk drives on that IRQ were no longer being serviced. Everything on IRQ 21 would have died - which may be why your networking failed too. Hmm, perhaps that 'nobody cared' message should be worded more strongly, and printed and KERN_CRIT? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possibly SATA related freeze killed networking and RAID
2007/11/21, Alan Cox [EMAIL PROTECTED]: I've had other freezes before but this was the first time I was able to see what was actually going on. IRQ 21 appears to be shared between sata_nv and ethernet. Does this mean my hardware/BIOS is broken somehow? Not neccessarily. It could a bug in one of the drivers using IRQ 21 (sata_nv or the nvidia ethernet), it could be another inactive device, or it could be a hardware funny. How can I tell if there's an inactive device? Nvidia stuff can be quite hard to diagnose as we have no documentation but we can try. The first question is whether it is network or disk triggered - seeing if heavy loads to one or the other trigger the problem might be a first plan. I haven't managed to trigger it again yet but at the time the CPU was heavily loaded and I was re-indexing a database which caused a lot of disk activity. I'm quite confident the network was pretty much idle at the time. -- noah - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possibly SATA related freeze killed networking and RAID
kernel: [734344.717844] irq 21: nobody cared (try booting with the irqpoll option) kernel: [734344.717866] Your machine decided to emit interrupt 21 without an apparent reason. Whatever caused that made the kernel shut down IRQ 21 at which point the disk drives on that IRQ were no longer being serviced. Everything on IRQ 21 would have died - which may be why your networking failed too. What do you have on IRQ 21 and is this a one off ? - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possibly SATA related freeze killed networking and RAID
2007/11/20, Alan Cox [EMAIL PROTECTED]: kernel: [734344.717844] irq 21: nobody cared (try booting with the irqpoll option) kernel: [734344.717866] Your machine decided to emit interrupt 21 without an apparent reason. Whatever caused that made the kernel shut down IRQ 21 at which point the disk drives on that IRQ were no longer being serviced. Everything on IRQ 21 would have died - which may be why your networking failed too. What do you have on IRQ 21 and is this a one off ? I've had other freezes before but this was the first time I was able to see what was actually going on. IRQ 21 appears to be shared between sata_nv and ethernet. Does this mean my hardware/BIOS is broken somehow? I'm running the latest BIOS available. # cat /proc/interruptsCPU0 CPU1 0: 264973603163 IO-APIC-edge timer 1: 0 2 IO-APIC-edge i8042 8: 0 0 IO-APIC-edge rtc 9: 0 0 IO-APIC-fasteoi acpi 12: 0 6 IO-APIC-edge i8042 16: 4851 669159 IO-APIC-fasteoi shpchp, libata 20: 0 0 IO-APIC-fasteoi sata_nv 21: 364434775430 IO-APIC-fasteoi sata_nv, eth0 22: 312614531218 IO-APIC-fasteoi ohci_hcd:usb1, sata_nv 23: 4 1649 IO-APIC-fasteoi HDA Intel, ehci_hcd:usb2 NMI: 0 0 LOC:36295623629543 ERR: 0 -- noah - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possibly SATA related freeze killed networking and RAID
I've had other freezes before but this was the first time I was able to see what was actually going on. IRQ 21 appears to be shared between sata_nv and ethernet. Does this mean my hardware/BIOS is broken somehow? Not neccessarily. It could a bug in one of the drivers using IRQ 21 (sata_nv or the nvidia ethernet), it could be another inactive device, or it could be a hardware funny. Nvidia stuff can be quite hard to diagnose as we have no documentation but we can try. The first question is whether it is network or disk triggered - seeing if heavy loads to one or the other trigger the problem might be a first plan. Alan - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html