Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
Thank you for the clarification Bjorn! I was on vacation..sorry for my delay. Closing the loop here, I understand we're not getting this patch merged (due to its restriction to domain 0) and there was a suggestion in the thread of trying to block MSIs from the IOMMU init code (which also have the restriction of only working in iommu-enabled systems). Hope the thread is helpful and if somebody faces this issue, can comment here and at least find this approach, maybe test the patch. Thanks to all involved!
Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
On Wed, Nov 18, 2020 at 07:36:08PM -0300, Guilherme Piccoli wrote: > Thanks a lot Bjorn! I confess except for PPC64 Server machines, I > never saw other "domains" or segments. Is it common in x86 to have > that? The early_quirks() are restricted to the first segment, no > matter how many host bridges we have in segment ? I don't know whether it's *common* to have multiple domains on x86, but they're definitely used on large systems. This includes some lspci info from an HPE Superdome Flex system: https://lore.kernel.org/lkml/5350e02a-6457-41a8-8f33-af67bfdae...@fb.com/ The early quirks in arch/x86/kernel/early-quirks.c (not the DECLARE_PCI_FIXUP_EARLY quirks in drivers/pci/quirks.c) are restricted to segment 0, no matter how many host bridges there are. This is because they use read_pci_config_16(), which uses a config access mechanism that has no provision for a segment number. Bjorn
Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
Thanks a lot Bjorn! I confess except for PPC64 Server machines, I never saw other "domains" or segments. Is it common in x86 to have that? The early_quirks() are restricted to the first segment, no matter how many host bridges we have in segment ? Thanks again!
Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
On Tue, Nov 17, 2020 at 09:04:07AM -0300, Guilherme Piccoli wrote: > Also, taking here the opportunity to clarify my understanding about > the limitations of that approach: Bjorn, in our reproducer machine we > had 3 parents in the PCI tree (as per lspci -t), :00, :ff and > :80 - are those all under "segment 0" as per your verbiage? Yes. The "" is the PCI segment (or "domain" in the Linux PCI core). It's common on x86 to have multiple host bridges in segment .
Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
Thomas Gleixner writes: > On Tue, Nov 17 2020 at 12:19, David Woodhouse wrote: >> On Tue, 2020-11-17 at 10:53 +0100, Thomas Gleixner wrote: >>> But that does not solve the problem either simply because then the IOMMU >>> will catch the rogue MSIs and you get an interrupt storm on the IOMMU >>> error interrupt. >> >> Not if you can tell the IOMMU to stop reporting those errors. >> >> We can easily do it per-device for DMA errors; not quite sure what >> granularity we have for interrupts. Perhaps the Intel IOMMU only lets >> you set the Fault Processing Disable bit per IRTE entry, and you still >> get faults for Compatibility Format interrupts? Not sure about AMD... > > It looks like the fault (DMAR) and event (AMD) interrupts can be > disabled in the IOMMU. That might help to bridge the gap until the PCI > bus is scanned in full glory and the devices can be shut up for real. > > If we make this conditional for a crash dump kernel that might do the > trick. > > Lot's of _might_ there :) Worth testing. Folks tracking this down is this enough of a hint for you to write a patch and test? Also worth checking how close irqpoll is to handling a case like this. At least historically it did a pretty good job at shutting down problem interrupts. I really find it weird that an edge triggered irq was firing fast enough to stop a system from booting. Level triggered irqs do that if they are acknolwedged without actually being handled. I think edge triggered irqs only fire when another event comes in, and it seems odd to get so many actual events causing interrupts that the system soft locks up. Is my memory of that situation confused? I recommend making these facilities general debug facilities so that they can be used for cases other than crash dump. The crash dump kernel would always enable them because it can assume that the hardware is very likely in a wonky state. Eric
Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
On Tue, Nov 17 2020 at 12:19, David Woodhouse wrote: > On Tue, 2020-11-17 at 10:53 +0100, Thomas Gleixner wrote: >> But that does not solve the problem either simply because then the IOMMU >> will catch the rogue MSIs and you get an interrupt storm on the IOMMU >> error interrupt. > > Not if you can tell the IOMMU to stop reporting those errors. > > We can easily do it per-device for DMA errors; not quite sure what > granularity we have for interrupts. Perhaps the Intel IOMMU only lets > you set the Fault Processing Disable bit per IRTE entry, and you still > get faults for Compatibility Format interrupts? Not sure about AMD... It looks like the fault (DMAR) and event (AMD) interrupts can be disabled in the IOMMU. That might help to bridge the gap until the PCI bus is scanned in full glory and the devices can be shut up for real. If we make this conditional for a crash dump kernel that might do the trick. Lot's of _might_ there :) Thanks tglx
Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
On Tue, 2020-11-17 at 10:53 +0100, Thomas Gleixner wrote: > But that does not solve the problem either simply because then the IOMMU > will catch the rogue MSIs and you get an interrupt storm on the IOMMU > error interrupt. Not if you can tell the IOMMU to stop reporting those errors. We can easily do it per-device for DMA errors; not quite sure what granularity we have for interrupts. Perhaps the Intel IOMMU only lets you set the Fault Processing Disable bit per IRTE entry, and you still get faults for Compatibility Format interrupts? Not sure about AMD... smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
On Mon, Nov 16, 2020 at 10:07 PM Eric W. Biederman wrote: > [...] > > I think we need to disable MSIs in the crashing kernel before the > > kexec. It adds a little more code in the crash_kexec() path, but it > > seems like a worthwhile tradeoff. > > Disabling MSIs in the b0rken kernel is not possible. > > Walking the device tree or even a significant subset of it hugely > decreases the chances that we will run into something that is incorrect > in the known broken kernel. I expect the code to do that would double > or triple the amount of code that must be executed in the known broken > kernel. The last time something like that happened (switching from xchg > to ordinary locks) we had cases that stopped working. Walking all of > the pci devices in the system is much more invasive. > I think we could try to decouple this problem in 2, if that makes sense. Bjorn/others can jump in and correct me if I'm wrong. So, the problem is to walk a PCI topology tree, identify every device and disable MSI(/INTx maybe) in them, and the big deal with doing that in the broken kernel before the kexec is that this task is complex due to the tree walk, mainly. But what if we keep a table (a simple 2-D array) with the address and data to be written to disable the MSIs, and before the kexec we could have a parameter enabling a function that just goes through this array and performs the writes? The table itself would be constructed by the PCI core (and updated in the hotplug path), when device discovery is happening. This table would live in a protected area in memory, with no write/execute access, this way if the kernel itself tries to corrupt that, we get a fault (yeah, I know DMAs can corrupt anywhere, but IOMMU could protect against that). If the parameter "kdump_clear_msi" is passed in the cmdline of the regular kernel, for example, then the function walks this simple table and performs the devices' writes before the kexec... If that's not possible or a bad idea for any reason, I still think the early_quirks() idea hereby proposed is not something we should discard, because it'll help a lot of users even with its limitations (in our case it worked very well). Also, taking here the opportunity to clarify my understanding about the limitations of that approach: Bjorn, in our reproducer machine we had 3 parents in the PCI tree (as per lspci -t), :00, :ff and :80 - are those all under "segment 0" as per your verbiage? In our case the troublemaker was under :80, and the early_quirks() shut the device up successfully. Thanks, Guilherme
Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
On Mon, Nov 16 2020 at 19:06, Eric W. Biederman wrote: > Bjorn Helgaas writes: > My two top candidates are poking the IOMMUs early to shut things off, > and figuring out if we can delay enabling interrupts until we have > initialized pci. Keeping interrupts disabled post PCI initialization would be nice, but that requires feeding the whole init machinery through a shredder and collecting the bits and pieces. > Poking at IOMMUs early should work for most systems with ``enterprise'' > hardware. Systems where people care about kdump the most. The IOMMU IRQ remapping part _is_ initialized early and _before_ interrupts are enabled. But that does not solve the problem either simply because then the IOMMU will catch the rogue MSIs and you get an interrupt storm on the IOMMU error interrupt. This one is not going to be turned off because the IOMMU error interrupt handler will handle each of them and tell the core code that everything is under control. As I explained several times now, the only way to shut this up reliably is at the source. Curing the symptom is almost never a good solution. Thanks, tglx
Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
Bjorn Helgaas writes: > I don't think passing the device information to the kdump kernel is > really practical. The kdump kernel would use it to do PCI config > writes to disable MSIs before enabling IRQs, and it doesn't know how > to access config space that early. I don't think it is particularly practical either. But in practice on x86 it is either mmio writes or 0xcf8 style writes and we could pass a magic table that would have all of that information. > We could invent special "early config access" things, but that gets > really complicated really fast. Config access depends on ACPI MCFG > tables, firmware interfaces, and in many cases, on the native host > bridge drivers in drivers/pci/controllers/. I do agree that the practical problem with passing information early is that gets us into the weeds and creates code that we only care about in the case of kexec-on-panic. It is much better to make the existing code more robust, so that we reduce our dependency on firmware doing the right thing. > I think we need to disable MSIs in the crashing kernel before the > kexec. It adds a little more code in the crash_kexec() path, but it > seems like a worthwhile tradeoff. Disabling MSIs in the b0rken kernel is not possible. Walking the device tree or even a significant subset of it hugely decreases the chances that we will run into something that is incorrect in the known broken kernel. I expect the code to do that would double or triple the amount of code that must be executed in the known broken kernel. The last time something like that happened (switching from xchg to ordinary locks) we had cases that stopped working. Walking all of the pci devices in the system is much more invasive. That is not to downplay the problems of figuring out how to disable things in early boot. My two top candidates are poking the IOMMUs early to shut things off, and figuring out if we can delay enabling interrupts until we have initialized pci. Poking at IOMMUs early should work for most systems with ``enterprise'' hardware. Systems where people care about kdump the most. Eric
Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
On Mon, Nov 16, 2020 at 05:31:36PM -0300, Guilherme G. Piccoli wrote: > First of all, thanks everybody for the great insights/discussion! This > thread ended-up being a great learning for (at least) me. > > Given the big amount of replies and intermixed comments, I wouldn't be > able to respond inline to all, so I'll try another approach below. > > > From Bjorn: > "I think [0] proposes using early_quirks() to disable MSIs at boot-time. > That doesn't seem like a robust solution because (a) the problem affects > all arches but early_quirks() is x86-specific and (b) even on x86 > early_quirks() only works for PCI segment 0 because it relies on the > 0xCF8/0xCFC I/O ports." > > Ah. I wasn't aware of that limitation, I thought enhancing the > early_quirks() search to go through all buses would fix that, thanks for > the clarification! And again, worth to clarify that this is not a > problem affecting all arches _in practice_ - PowerPC for example has the > FW primitives allowing a powerful PCI controller (out-of-band) reset, > preventing this kind of issue usually. > > [0] > https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpicc...@canonical.com > > > From Bjorn: > "A crash_device_shutdown() could do something at the host bridge level > if that's possible, or reset/disable bus mastering/disable MSI/etc on > individual PCI devices if necessary." > > From Lukas: > "Guilherme's original patches from 2018 iterate over all 256 PCI buses. > That might impact boot time negatively. The reason he has to do that is > because the crashing kernel doesn't know which devices exist and which > have interrupts enabled. However the crashing kernel has that > information. It should either disable interrupts itself or pass the > necessary information to the crashing kernel as setup_data or whatever. I don't think passing the device information to the kdump kernel is really practical. The kdump kernel would use it to do PCI config writes to disable MSIs before enabling IRQs, and it doesn't know how to access config space that early. We could invent special "early config access" things, but that gets really complicated really fast. Config access depends on ACPI MCFG tables, firmware interfaces, and in many cases, on the native host bridge drivers in drivers/pci/controllers/. I think we need to disable MSIs in the crashing kernel before the kexec. It adds a little more code in the crash_kexec() path, but it seems like a worthwhile tradeoff. Bjorn
Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
On Mon, Nov 16, 2020 at 6:45 PM Eric W. Biederman wrote: > The way to do that would be to collect the set of pci devices when the > kexec on panic kernel is loaded, not during crash_kexec. If someone > performs a device hotplug they would need to reload the kexec on panic > kernel. > > I am not necessarily endorsing that just pointing out how it can be > done. > > Eric Thanks Eric, I agree! I think if we use something like PKRAM (a more dynamic approach) we could have the PCI hotplug path updating the data to-be-passed to the crash kernel, so the crash kernel doesn't even need to be loaded again.
Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
"Guilherme G. Piccoli" writes: > First of all, thanks everybody for the great insights/discussion! This > thread ended-up being a great learning for (at least) me. > > Given the big amount of replies and intermixed comments, I wouldn't be > able to respond inline to all, so I'll try another approach below. > > > From Bjorn: > "I think [0] proposes using early_quirks() to disable MSIs at boot-time. > That doesn't seem like a robust solution because (a) the problem affects > all arches but early_quirks() is x86-specific and (b) even on x86 > early_quirks() only works for PCI segment 0 because it relies on the > 0xCF8/0xCFC I/O ports." > > Ah. I wasn't aware of that limitation, I thought enhancing the > early_quirks() search to go through all buses would fix that, thanks for > the clarification! And again, worth to clarify that this is not a > problem affecting all arches _in practice_ - PowerPC for example has the > FW primitives allowing a powerful PCI controller (out-of-band) reset, > preventing this kind of issue usually. > > [0] > https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpicc...@canonical.com > > > From Bjorn: > "A crash_device_shutdown() could do something at the host bridge level > if that's possible, or reset/disable bus mastering/disable MSI/etc on > individual PCI devices if necessary." > > From Lukas: > "Guilherme's original patches from 2018 iterate over all 256 PCI buses. > That might impact boot time negatively. The reason he has to do that is > because the crashing kernel doesn't know which devices exist and which > have interrupts enabled. However the crashing kernel has that > information. It should either disable interrupts itself or pass the > necessary information to the crashing kernel as setup_data or whatever. > > Guilherme's patches add a "clearmsi" command line parameter. I guess > the idea is that the parameter is always passed to the crash kernel but > the patches don't do that, which seems odd." > > Very good points Lukas, thanks! The reason of not adding the clearmsi > thing as a default kdump procedure is kinda related to your first point: > it impacts a bit boot-time, also it's an additional logic in the kdump > kernel, which we know is (sometimes) the last resort in gathering > additional data to debug a potentially complex issue. That said, I'd not > like to enforce this kind of "policy" to everybody, so my implementation > relies on having it as a parameter, and the kdump userspace counter-part > could then have a choice in adding or not such mechanism to the kdump > kernel parameter list. > > About passing the data to next kernel, this is very interesting, we > could do something like that either through setup_data (as you said) or > using a new proposal, the PKRAM thing [a]. > This way we wouldn't need a crash_device_shutdown(), but instead when > kernel is loading a crash kernel (through kexec -p) we can collect all > the necessary information that'll be passed to the crash kernel > (obviously that if we are collecting PCI topology information, we need > callbacks in the PCI hotplug add/del path to update this information). > > [a] > https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yzn...@oracle.com/ > > Below, inline reply to Eric's last message. > > > On 15/11/2020 17:46, Eric W. Biederman wrote: >>> [...] >>> An MSI interrupt is a (DMA) write to the local APIC base address >>> 0xfeex which has the target CPU and control bits encoded in the >>> lower bits. The written data is the vector and more control bits. >>> >>> The only way to stop that is to shut it up at the PCI device level. >> >> Or to write to magic chipset registers that will stop transforming DMA >> writes to 0xfeex into x86 interrupts. With an IOMMU I know x86 has >> such registers (because the point of the IOMMU is to limit the problems >> rogue devices can cause). Without an IOMMU I don't know if x86 has any >> such registers. I remember that other platforms have an interrupt >> controller that does provide some level of control. That x86 does not >> is what makes this an x86 specific problem. >> [...] >> Looking at patch 3/3 what this patchset does is an early disable of >> of the msi registers. Which is mostly reasonable. Especially as has >> been pointed out the only location the x86 vector and x86 cpu can >> be found is in the msi configuration registers. >> >> That also seems reasonable. But Bjorn's concern about not finding all >> devices in all domains does seem real. >> [...] >> So if we can safely and reliably disable DMA and MSI at the generic PCI >> device level during boot up I am all for it. >> >> >> How difficult would it be to tell the IOMMUs to stop passing traffic >> through in an early pci quirk? The problem setup was apparently someone >> using the device directly from a VM. So I presume there is an IOMMU >> in that configuration. > > This is a good idea I think - we considered something like that in > theory while
Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
First of all, thanks everybody for the great insights/discussion! This thread ended-up being a great learning for (at least) me. Given the big amount of replies and intermixed comments, I wouldn't be able to respond inline to all, so I'll try another approach below. >From Bjorn: "I think [0] proposes using early_quirks() to disable MSIs at boot-time. That doesn't seem like a robust solution because (a) the problem affects all arches but early_quirks() is x86-specific and (b) even on x86 early_quirks() only works for PCI segment 0 because it relies on the 0xCF8/0xCFC I/O ports." Ah. I wasn't aware of that limitation, I thought enhancing the early_quirks() search to go through all buses would fix that, thanks for the clarification! And again, worth to clarify that this is not a problem affecting all arches _in practice_ - PowerPC for example has the FW primitives allowing a powerful PCI controller (out-of-band) reset, preventing this kind of issue usually. [0] https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpicc...@canonical.com >From Bjorn: "A crash_device_shutdown() could do something at the host bridge level if that's possible, or reset/disable bus mastering/disable MSI/etc on individual PCI devices if necessary." >From Lukas: "Guilherme's original patches from 2018 iterate over all 256 PCI buses. That might impact boot time negatively. The reason he has to do that is because the crashing kernel doesn't know which devices exist and which have interrupts enabled. However the crashing kernel has that information. It should either disable interrupts itself or pass the necessary information to the crashing kernel as setup_data or whatever. Guilherme's patches add a "clearmsi" command line parameter. I guess the idea is that the parameter is always passed to the crash kernel but the patches don't do that, which seems odd." Very good points Lukas, thanks! The reason of not adding the clearmsi thing as a default kdump procedure is kinda related to your first point: it impacts a bit boot-time, also it's an additional logic in the kdump kernel, which we know is (sometimes) the last resort in gathering additional data to debug a potentially complex issue. That said, I'd not like to enforce this kind of "policy" to everybody, so my implementation relies on having it as a parameter, and the kdump userspace counter-part could then have a choice in adding or not such mechanism to the kdump kernel parameter list. About passing the data to next kernel, this is very interesting, we could do something like that either through setup_data (as you said) or using a new proposal, the PKRAM thing [a]. This way we wouldn't need a crash_device_shutdown(), but instead when kernel is loading a crash kernel (through kexec -p) we can collect all the necessary information that'll be passed to the crash kernel (obviously that if we are collecting PCI topology information, we need callbacks in the PCI hotplug add/del path to update this information). [a] https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yzn...@oracle.com/ Below, inline reply to Eric's last message. On 15/11/2020 17:46, Eric W. Biederman wrote: >> [...] >> An MSI interrupt is a (DMA) write to the local APIC base address >> 0xfeex which has the target CPU and control bits encoded in the >> lower bits. The written data is the vector and more control bits. >> >> The only way to stop that is to shut it up at the PCI device level. > > Or to write to magic chipset registers that will stop transforming DMA > writes to 0xfeex into x86 interrupts. With an IOMMU I know x86 has > such registers (because the point of the IOMMU is to limit the problems > rogue devices can cause). Without an IOMMU I don't know if x86 has any > such registers. I remember that other platforms have an interrupt > controller that does provide some level of control. That x86 does not > is what makes this an x86 specific problem. > [...] > Looking at patch 3/3 what this patchset does is an early disable of > of the msi registers. Which is mostly reasonable. Especially as has > been pointed out the only location the x86 vector and x86 cpu can > be found is in the msi configuration registers. > > That also seems reasonable. But Bjorn's concern about not finding all > devices in all domains does seem real. > [...] > So if we can safely and reliably disable DMA and MSI at the generic PCI > device level during boot up I am all for it. > > > How difficult would it be to tell the IOMMUs to stop passing traffic > through in an early pci quirk? The problem setup was apparently someone > using the device directly from a VM. So I presume there is an IOMMU > in that configuration. This is a good idea I think - we considered something like that in theory while working the problem back in 2018, but given I'm even less expert in IOMMU that I already am in PCI, the option was to go with the PCI approach. And you are right, the original problem is a
Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
Thomas Gleixner writes: > On Sun, Nov 15 2020 at 08:29, Eric W. Biederman wrote: >> ebied...@xmission.com (Eric W. Biederman) writes: >> For ordinary irqs you can have this with level triggered irqs >> and the kernel has code that will shutdown the irq at the ioapic >> level. Then the kernel continues by polling the irq source. >> >> I am still missing details but my first question is can our general >> solution to screaming level triggered irqs apply? > > No. > >> How can edge triggered MSI irqs be screaming? >> >> Is there something we can do in enabling the APICs or IOAPICs that >> would allow this to be handled better. My memory when we enable >> the APICs and IOAPICs we completely clear the APIC entries and so >> should be disabling sources. > > Yes, but MSI has nothing to do with APIC/IOAPIC Yes, sorry. It has been long enough that the details were paged out of my memory. >> Is the problem perhaps that we wind up using an APIC entry that was >> previously used for the MSI interrupt as something else when we >> reprogram them? Even with this why doesn't the generic code >> to stop screaming irqs apply here? > > Again. No. The problem cannot be solved at the APIC level. The APIC is > the receiving end of MSI and has absolutely no control over it. > > An MSI interrupt is a (DMA) write to the local APIC base address > 0xfeex which has the target CPU and control bits encoded in the > lower bits. The written data is the vector and more control bits. > > The only way to stop that is to shut it up at the PCI device level. Or to write to magic chipset registers that will stop transforming DMA writes to 0xfeex into x86 interrupts. With an IOMMU I know x86 has such registers (because the point of the IOMMU is to limit the problems rogue devices can cause). Without an IOMMU I don't know if x86 has any such registers. I remember that other platforms have an interrupt controller that does provide some level of control. That x86 does not is what makes this an x86 specific problem. The generic solution is to have the PCI code set bus master disables when it is enumerationg and initializing devices. Last time I was paying attention that was actually the policy of the pci layer and drivers that did not enable bus mastering were considered buggy. Looking at patch 3/3 what this patchset does is an early disable of of the msi registers. Which is mostly reasonable. Especially as has been pointed out the only location the x86 vector and x86 cpu can be found is in the msi configuration registers. That also seems reasonable. But Bjorn's concern about not finding all devices in all domains does seem real. There are a handful of devices where the Bus master disable bit doesn't disable bus mastering. I wonder if there are devices where MSI and MSIX disables don't fully work. It seems completely possible to have MSI or MSIX equivalent registers at a non-standard location as drivers must be loaded to handle them. So if we can safely and reliably disable DMA and MSI at the generic PCI device level during boot up I am all for it. How difficult would it be to tell the IOMMUs to stop passing traffic through in an early pci quirk? The problem setup was apparently someone using the device directly from a VM. So I presume there is an IOMMU in that configuration. > Unfortunately there is no way to tell the APIC "Mask vector X" and the > dump kernel does neither know which device it comes from nor does it > have enumerated PCI completely which would reset the device and shutup > the spew. Due to the interrupt storm it does not get that far. So the question is how do we make this robust? Can we perhaps disable all interrupts in this case and limp along in polling mode until the pci bus has been enumerated? It is nice and desirable to be able to use the hardware in high performance mode in a kexec-on-panic situation but if we can detect a problem and figure out how to limp along sometimes that is acceptable. The failure mode in the kexec-on-panic kernel is definitely the corect one. We could not figure out how to wrestle the hardware into usability so we fail to take a crash dump or do anything else that might corrupt the system. Eric
Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
On Sun, Nov 15 2020 at 18:01, Lukas Wunner wrote: > On Sun, Nov 15, 2020 at 04:11:43PM +0100, Thomas Gleixner wrote: >> Unfortunately there is no way to tell the APIC "Mask vector X" and the >> dump kernel does neither know which device it comes from nor does it >> have enumerated PCI completely which would reset the device and shutup >> the spew. Due to the interrupt storm it does not get that far. > > Can't we just set DisINTx, clear MSI Enable and clear MSI-X Enable > on all active PCI devices in the crashing kernel before starting the > crash kernel? So that the crash kernel starts with a clean slate? > > Guilherme's original patches from 2018 iterate over all 256 PCI buses. > That might impact boot time negatively. The reason he has to do that > is because the crashing kernel doesn't know which devices exist and > which have interrupts enabled. However the crashing kernel has that > information. It should either disable interrupts itself or pass the > necessary information to the crashing kernel as setup_data or whatever. As I explained before: The problem with doing anything between crashing and starting the crash kernel is reducing the chance of actually starting it. The kernel crashed for whatever reason, so it's in a completely undefined state. Ergo there is no 'just do something'. You really have to think hard about what can be done safely with the least probability of running into another problem. Thanks, tglx
Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
On Sun, Nov 15, 2020 at 04:11:43PM +0100, Thomas Gleixner wrote: > Unfortunately there is no way to tell the APIC "Mask vector X" and the > dump kernel does neither know which device it comes from nor does it > have enumerated PCI completely which would reset the device and shutup > the spew. Due to the interrupt storm it does not get that far. Can't we just set DisINTx, clear MSI Enable and clear MSI-X Enable on all active PCI devices in the crashing kernel before starting the crash kernel? So that the crash kernel starts with a clean slate? Guilherme's original patches from 2018 iterate over all 256 PCI buses. That might impact boot time negatively. The reason he has to do that is because the crashing kernel doesn't know which devices exist and which have interrupts enabled. However the crashing kernel has that information. It should either disable interrupts itself or pass the necessary information to the crashing kernel as setup_data or whatever. Guilherme's patches add a "clearmsi" command line parameter. I guess the idea is that the parameter is always passed to the crash kernel but the patches don't do that, which seems odd. Thanks, Lukas
Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
On Sun, Nov 15 2020 at 08:29, Eric W. Biederman wrote: > ebied...@xmission.com (Eric W. Biederman) writes: > For ordinary irqs you can have this with level triggered irqs > and the kernel has code that will shutdown the irq at the ioapic > level. Then the kernel continues by polling the irq source. > > I am still missing details but my first question is can our general > solution to screaming level triggered irqs apply? No. > How can edge triggered MSI irqs be screaming? > > Is there something we can do in enabling the APICs or IOAPICs that > would allow this to be handled better. My memory when we enable > the APICs and IOAPICs we completely clear the APIC entries and so > should be disabling sources. Yes, but MSI has nothing to do with APIC/IOAPIC > Is the problem perhaps that we wind up using an APIC entry that was > previously used for the MSI interrupt as something else when we > reprogram them? Even with this why doesn't the generic code > to stop screaming irqs apply here? Again. No. The problem cannot be solved at the APIC level. The APIC is the receiving end of MSI and has absolutely no control over it. An MSI interrupt is a (DMA) write to the local APIC base address 0xfeex which has the target CPU and control bits encoded in the lower bits. The written data is the vector and more control bits. The only way to stop that is to shut it up at the PCI device level. Assume the following situation: - PCI device has MSI enabled and a valid target vector assigned - Kernel crashes - Kdump kernel starts - PCI device raises interrupts which result in the MSI write - Kdump kernel enables interrupts and the pending vector is raised in the CPU. - The CPU has no interrupt descriptor assigned to the vector and does not even know where the interrupt originated from. So it treats it like any other spurious interrupt to an unassigned vector, emits a ratelimited message and ACKs the interrupt at the APIC. - PCI device behaves stupid and reraises the interrupt for whatever reason. - Lather, rinse and repeat. Unfortunately there is no way to tell the APIC "Mask vector X" and the dump kernel does neither know which device it comes from nor does it have enumerated PCI completely which would reset the device and shutup the spew. Due to the interrupt storm it does not get that far. Thanks, tglx
Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
ebied...@xmission.com (Eric W. Biederman) writes: > Bjorn Helgaas writes: > >> [+cc Rafael for question about ACPI method for PCI host bridge reset] >> >> On Sat, Nov 14, 2020 at 09:58:08PM +0100, Thomas Gleixner wrote: >>> On Sat, Nov 14 2020 at 14:39, Bjorn Helgaas wrote: >>> > On Sat, Nov 14, 2020 at 12:40:10AM +0100, Thomas Gleixner wrote: >>> >> On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote: >>> >> > On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote: >>> >> >> pci_device_shutdown() still clears the Bus Master Enable bit if we're >>> >> >> doing a kexec and the device is in D0-D3hot, which should also disable >>> >> >> MSI/MSI-X. Why doesn't this solve the problem? Is this because the >>> >> >> device causing the storm was in PCI_UNKNOWN state? >>> >> > >>> >> > That's indeed a really good question. >>> >> >>> >> So we do that on kexec, but is that true when starting a kdump kernel >>> >> from a kernel crash? I doubt it. >>> > >>> > Ah, right, I bet that's it, thanks. The kdump path is basically this: >>> > >>> > crash_kexec >>> > machine_kexec >>> > >>> > while the usual kexec path is: >>> > >>> > kernel_kexec >>> > kernel_restart_prepare >>> > device_shutdown >>> > while (!list_empty(_kset->list)) >>> > dev->bus->shutdown >>> > pci_device_shutdown# pci_bus_type.shutdown >>> > machine_kexec >>> > >>> > So maybe we need to explore doing some or all of device_shutdown() in >>> > the crash_kexec() path as well as in the kernel_kexec() path. >>> >>> The problem is that if the machine crashed anything you try to attempt >>> before starting the crash kernel is reducing the chance that the crash >>> kernel actually starts. >> >> Right. >> >>> Is there something at the root bridge level which allows to tell the >>> underlying busses to shut up, reset or go into a defined state? That >>> might avoid chasing lists which might be already unreliable. >> >> Maybe we need some kind of crash_device_shutdown() that does the >> minimal thing to protect the kdump kernel from devices. > > The kdump kernel does not use any memory the original kernel uses. > Which should be a minimal and fairly robust level of protection > until the device drivers can be loaded and get ahold of things. > >> The programming model for conventional PCI host bridges and PCIe Root >> Complexes is device-specific since they're outside the PCI domain. >> There probably *are* ways to do those things, but you would need a >> native host bridge driver or something like an ACPI method. I'm not >> aware of an ACPI way to do this, but I added Rafael in case he is. >> >> A crash_device_shutdown() could do something at the host bridge level >> if that's possible, or reset/disable bus mastering/disable MSI/etc on >> individual PCI devices if necessary. > > Unless I am confused DMA'ing to memory that is not already in use > is completely broken wether or not you are using the kdump kernel. Bah. I was confused because I had not read up-thread. MSI mixes DMA and irqs so confusion is easy. So the problem is screaming irqs when the kernel is booting up. This is a fundamentally tricky problem. For ordinary irqs you can have this with level triggered irqs and the kernel has code that will shutdown the irq at the ioapic level. Then the kernel continues by polling the irq source. I am still missing details but my first question is can our general solution to screaming level triggered irqs apply? How can edge triggered MSI irqs be screaming? Is there something we can do in enabling the APICs or IOAPICs that would allow this to be handled better. My memory when we enable the APICs and IOAPICs we completely clear the APIC entries and so should be disabling sources. Is the problem perhaps that we wind up using an APIC entry that was previously used for the MSI interrupt as something else when we reprogram them? Even with this why doesn't the generic code to stop screaming irqs apply here? Eric
Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
Bjorn Helgaas writes: > [+cc Rafael for question about ACPI method for PCI host bridge reset] > > On Sat, Nov 14, 2020 at 09:58:08PM +0100, Thomas Gleixner wrote: >> On Sat, Nov 14 2020 at 14:39, Bjorn Helgaas wrote: >> > On Sat, Nov 14, 2020 at 12:40:10AM +0100, Thomas Gleixner wrote: >> >> On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote: >> >> > On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote: >> >> >> pci_device_shutdown() still clears the Bus Master Enable bit if we're >> >> >> doing a kexec and the device is in D0-D3hot, which should also disable >> >> >> MSI/MSI-X. Why doesn't this solve the problem? Is this because the >> >> >> device causing the storm was in PCI_UNKNOWN state? >> >> > >> >> > That's indeed a really good question. >> >> >> >> So we do that on kexec, but is that true when starting a kdump kernel >> >> from a kernel crash? I doubt it. >> > >> > Ah, right, I bet that's it, thanks. The kdump path is basically this: >> > >> > crash_kexec >> > machine_kexec >> > >> > while the usual kexec path is: >> > >> > kernel_kexec >> > kernel_restart_prepare >> > device_shutdown >> > while (!list_empty(_kset->list)) >> > dev->bus->shutdown >> > pci_device_shutdown# pci_bus_type.shutdown >> > machine_kexec >> > >> > So maybe we need to explore doing some or all of device_shutdown() in >> > the crash_kexec() path as well as in the kernel_kexec() path. >> >> The problem is that if the machine crashed anything you try to attempt >> before starting the crash kernel is reducing the chance that the crash >> kernel actually starts. > > Right. > >> Is there something at the root bridge level which allows to tell the >> underlying busses to shut up, reset or go into a defined state? That >> might avoid chasing lists which might be already unreliable. > > Maybe we need some kind of crash_device_shutdown() that does the > minimal thing to protect the kdump kernel from devices. The kdump kernel does not use any memory the original kernel uses. Which should be a minimal and fairly robust level of protection until the device drivers can be loaded and get ahold of things. > The programming model for conventional PCI host bridges and PCIe Root > Complexes is device-specific since they're outside the PCI domain. > There probably *are* ways to do those things, but you would need a > native host bridge driver or something like an ACPI method. I'm not > aware of an ACPI way to do this, but I added Rafael in case he is. > > A crash_device_shutdown() could do something at the host bridge level > if that's possible, or reset/disable bus mastering/disable MSI/etc on > individual PCI devices if necessary. Unless I am confused DMA'ing to memory that is not already in use is completely broken wether or not you are using the kdump kernel. Eric
Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
[+cc Rafael for question about ACPI method for PCI host bridge reset] On Sat, Nov 14, 2020 at 09:58:08PM +0100, Thomas Gleixner wrote: > On Sat, Nov 14 2020 at 14:39, Bjorn Helgaas wrote: > > On Sat, Nov 14, 2020 at 12:40:10AM +0100, Thomas Gleixner wrote: > >> On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote: > >> > On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote: > >> >> pci_device_shutdown() still clears the Bus Master Enable bit if we're > >> >> doing a kexec and the device is in D0-D3hot, which should also disable > >> >> MSI/MSI-X. Why doesn't this solve the problem? Is this because the > >> >> device causing the storm was in PCI_UNKNOWN state? > >> > > >> > That's indeed a really good question. > >> > >> So we do that on kexec, but is that true when starting a kdump kernel > >> from a kernel crash? I doubt it. > > > > Ah, right, I bet that's it, thanks. The kdump path is basically this: > > > > crash_kexec > > machine_kexec > > > > while the usual kexec path is: > > > > kernel_kexec > > kernel_restart_prepare > > device_shutdown > > while (!list_empty(_kset->list)) > > dev->bus->shutdown > > pci_device_shutdown# pci_bus_type.shutdown > > machine_kexec > > > > So maybe we need to explore doing some or all of device_shutdown() in > > the crash_kexec() path as well as in the kernel_kexec() path. > > The problem is that if the machine crashed anything you try to attempt > before starting the crash kernel is reducing the chance that the crash > kernel actually starts. Right. > Is there something at the root bridge level which allows to tell the > underlying busses to shut up, reset or go into a defined state? That > might avoid chasing lists which might be already unreliable. Maybe we need some kind of crash_device_shutdown() that does the minimal thing to protect the kdump kernel from devices. The programming model for conventional PCI host bridges and PCIe Root Complexes is device-specific since they're outside the PCI domain. There probably *are* ways to do those things, but you would need a native host bridge driver or something like an ACPI method. I'm not aware of an ACPI way to do this, but I added Rafael in case he is. A crash_device_shutdown() could do something at the host bridge level if that's possible, or reset/disable bus mastering/disable MSI/etc on individual PCI devices if necessary. Bjorn
Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
Bjorn, On Sat, Nov 14 2020 at 14:39, Bjorn Helgaas wrote: > On Sat, Nov 14, 2020 at 12:40:10AM +0100, Thomas Gleixner wrote: >> On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote: >> > On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote: >> >> pci_device_shutdown() still clears the Bus Master Enable bit if we're >> >> doing a kexec and the device is in D0-D3hot, which should also disable >> >> MSI/MSI-X. Why doesn't this solve the problem? Is this because the >> >> device causing the storm was in PCI_UNKNOWN state? >> > >> > That's indeed a really good question. >> >> So we do that on kexec, but is that true when starting a kdump kernel >> from a kernel crash? I doubt it. > > Ah, right, I bet that's it, thanks. The kdump path is basically this: > > crash_kexec > machine_kexec > > while the usual kexec path is: > > kernel_kexec > kernel_restart_prepare > device_shutdown > while (!list_empty(_kset->list)) > dev->bus->shutdown > pci_device_shutdown# pci_bus_type.shutdown > machine_kexec > > So maybe we need to explore doing some or all of device_shutdown() in > the crash_kexec() path as well as in the kernel_kexec() path. The problem is that if the machine crashed anything you try to attempt before starting the crash kernel is reducing the chance that the crash kernel actually starts. Is there something at the root bridge level which allows to tell the underlying busses to shut up, reset or go into a defined state? That might avoid chasing lists which might be already unreliable. Thanks, tglx
Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
On Sat, Nov 14, 2020 at 12:40:10AM +0100, Thomas Gleixner wrote: > Bjorn, > > On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote: > > On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote: > >> pci_device_shutdown() still clears the Bus Master Enable bit if we're > >> doing a kexec and the device is in D0-D3hot, which should also disable > >> MSI/MSI-X. Why doesn't this solve the problem? Is this because the > >> device causing the storm was in PCI_UNKNOWN state? > > > > That's indeed a really good question. > > So we do that on kexec, but is that true when starting a kdump kernel > from a kernel crash? I doubt it. Ah, right, I bet that's it, thanks. The kdump path is basically this: crash_kexec machine_kexec while the usual kexec path is: kernel_kexec kernel_restart_prepare device_shutdown while (!list_empty(_kset->list)) dev->bus->shutdown pci_device_shutdown# pci_bus_type.shutdown machine_kexec So maybe we need to explore doing some or all of device_shutdown() in the crash_kexec() path as well as in the kernel_kexec() path.
Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
Bjorn, On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote: > On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote: >> pci_device_shutdown() still clears the Bus Master Enable bit if we're >> doing a kexec and the device is in D0-D3hot, which should also disable >> MSI/MSI-X. Why doesn't this solve the problem? Is this because the >> device causing the storm was in PCI_UNKNOWN state? > > That's indeed a really good question. So we do that on kexec, but is that true when starting a kdump kernel from a kernel crash? I doubt it. Thanks, tglx
Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
Bjorn, On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote: > On Fri, Nov 06, 2020 at 10:14:14AM -0300, Guilherme G. Piccoli wrote: >> On 23/10/2018 14:03, Bjorn Helgaas wrote: > I guess Thomas' patch [2] (from thread [1]) doesn't solve this > problem? No. As I explained in [1] patch from [2] cannot solve it because the patch from [2] which is what Liu was trying to solve requires that there is a registered interrupt handler which knows how to shut up the interrupt. > I think [0] proposes using early_quirks() to disable MSIs at > boot-time. That doesn't seem like a robust solution because (a) the > problem affects all arches but early_quirks() is x86-specific and (b) > even on x86 early_quirks() only works for PCI segment 0 because it > relies on the 0xCF8/0xCFC I/O ports. > > If I understand Thomas' email correctly, the IRQ storm occurs here: > > start_kernel > setup_arch > early_quirks # x86-only > ... > read_pci_config_16(num, slot, func, PCI_VENDOR_ID) > outl(..., 0xcf8) # PCI segment 0 only > inw(0xcfc) > local_irq_enable > ... > native_irq_enable > asm("sti") # <-- enable IRQ, storm occurs > > native_irq_enable() happens long before we discover PCI host bridges > and run the normal PCI quirks, so those would be too late to disable > MSIs. Correct. > It doesn't seem practical to disable MSIs in the kdump kernel at the > PCI level. I was hoping we could disable them somewhere in the IRQ > code, e.g., at IOAPICs, but I think Thomas is saying that's not > feasible. MSIs are not even going near the IOAPIC and as long as the interrupt core does not have an interrupt set up for the device is has no idea where to look at to shut it down. Actually it does not even reach the interrupt core. The raised vector arrives at the CPU and the low level code sees: No handler associated, ignore it. We cannot do anything from the low level code because all we know is that the vector was raised, but we have absolutely zero clue where that came from. At that point the IO-APIC interrupts are definitely not the problem because they are all disabled. > It seems like the only option left is to disable MSIs before the > kexec. We used to clear the MSI/MSI-X Enable bits in > pci_device_shutdown(), but that broke console devices that relied on > MSI and caused "nobody cared" warnings when the devices fell back to > using INTx, so fda78d7a0ead ("PCI/MSI: Stop disabling MSI/MSI-X in > pci_device_shutdown()") left them unchanged. That might be solvable because INTx arrives at the IO-APIC and we could mask all the INTx related IO-APIC lines, but that's icky because of this: > pci_device_shutdown() still clears the Bus Master Enable bit if we're > doing a kexec and the device is in D0-D3hot, which should also disable > MSI/MSI-X. Why doesn't this solve the problem? Is this because the > device causing the storm was in PCI_UNKNOWN state? That's indeed a really good question. Thanks, tglx
Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
On Fri, Nov 06, 2020 at 10:14:14AM -0300, Guilherme G. Piccoli wrote: > On 23/10/2018 14:03, Bjorn Helgaas wrote: > > On Mon, Oct 22, 2018 at 05:35:06PM -0300, Guilherme G. Piccoli wrote: > >> On 18/10/2018 19:15, Bjorn Helgaas wrote: > >>> On Thu, Oct 18, 2018 at 03:37:19PM -0300, Guilherme G. Piccoli wrote: > >>> [...] > >> I understand your point, but I think this is inherently an architecture > >> problem. No matter what solution we decide for, it'll need to be applied > >> in early boot time, like before the PCI layer gets initialized. > > > > This is the part I want to know more about. Apparently there's some > > event X between early_quirks() and the PCI device enumeration, and we > > must disable MSIs before X: > > > > setup_arch() > > early_quirks() # arch/x86/kernel/early-quirks.c > > early_pci_clear_msi() > > ... > > X > > ... > > pci_scan_root_bus_bridge() > > ... > > DECLARE_PCI_FIXUP_EARLY # drivers/pci/quirks.c > > > > I want to know specifically what X is. If we don't know what X is and > > all we know is "we have to disable MSIs earlier than PCI init", then > > we're likely to break things again in the future by changing the order > > of disabling MSIs and whatever X is. > > Hi Bjorn (and all CCed), I'm sorry to necro-bump a thread >2 years > later, but recent discussions led to a better understanding of this 'X' > point, thanks to Thomas! > > For those that deleted this thread from their email clients, it's > available in [0] - the summary is that we faced an IRQ storm really > early in boot, due to a bogus PCIe device MSI behavior, when booting a > kdump kernel. This led the machine to get stuck in the boot and we > couldn't kdump. The solution hereby proposed is to clear MSI interrupts > early in x86, if a parameter is provided. I don't have the reproducer > anymore and it was pretty hard to reproduce in virtual environments. > > So, about the 'X' Bjorn, in another recent thread about IRQ storms [1], > Thomas clarified that and after a brief discussion, it seems there's no > better way to prevent the MSI storm other than clearing the MSI > capability early in boot. As discussed both here and in thread [1], this > is indeed a per-architecture issue (powerpc is not subject to that, due > to a better FW reset mechanism), so I think we still could benefit in > having this idea implemented upstream, at least in x86 (we could expand > to other architectures if desired, in the future). > > As a "test" data point, this was implemented in Ubuntu (same 3 patches > present in this series) for ~2 years and we haven't received bug reports > - I'm saying that because I understand your concerns about expanding the > early PCI quirks scope. > > Let me know your thoughts. I'd suggest all to read thread [1], which > addresses a similar issue but in a different "moment" of the system boot > and provides some more insight on why the early MSI clearing seems to > make sense. I guess Thomas' patch [2] (from thread [1]) doesn't solve this problem? I think [0] proposes using early_quirks() to disable MSIs at boot-time. That doesn't seem like a robust solution because (a) the problem affects all arches but early_quirks() is x86-specific and (b) even on x86 early_quirks() only works for PCI segment 0 because it relies on the 0xCF8/0xCFC I/O ports. If I understand Thomas' email correctly, the IRQ storm occurs here: start_kernel setup_arch early_quirks # x86-only ... read_pci_config_16(num, slot, func, PCI_VENDOR_ID) outl(..., 0xcf8) # PCI segment 0 only inw(0xcfc) local_irq_enable ... native_irq_enable asm("sti") # <-- enable IRQ, storm occurs native_irq_enable() happens long before we discover PCI host bridges and run the normal PCI quirks, so those would be too late to disable MSIs. It doesn't seem practical to disable MSIs in the kdump kernel at the PCI level. I was hoping we could disable them somewhere in the IRQ code, e.g., at IOAPICs, but I think Thomas is saying that's not feasible. It seems like the only option left is to disable MSIs before the kexec. We used to clear the MSI/MSI-X Enable bits in pci_device_shutdown(), but that broke console devices that relied on MSI and caused "nobody cared" warnings when the devices fell back to using INTx, so fda78d7a0ead ("PCI/MSI: Stop disabling MSI/MSI-X in pci_device_shutdown()") left them unchanged. pci_device_shutdown() still clears the Bus Master Enable bit if we're doing a kexec and the device is in D0-D3hot, which should also disable MSI/MSI-X. Why doesn't this solve the problem? Is this because the device causing the storm was in PCI_UNKNOWN state? > [0] > https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpicc...@canonical.com > > [1] https://lore.kernel.org/lkml/87y2js3ghv@nanos.tec.linutronix.de [2]
Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
On 23/10/2018 14:03, Bjorn Helgaas wrote: > On Mon, Oct 22, 2018 at 05:35:06PM -0300, Guilherme G. Piccoli wrote: >> On 18/10/2018 19:15, Bjorn Helgaas wrote: >>> On Thu, Oct 18, 2018 at 03:37:19PM -0300, Guilherme G. Piccoli wrote: >>> [...] >> I understand your point, but I think this is inherently an architecture >> problem. No matter what solution we decide for, it'll need to be applied >> in early boot time, like before the PCI layer gets initialized. > > This is the part I want to know more about. Apparently there's some > event X between early_quirks() and the PCI device enumeration, and we > must disable MSIs before X: > > setup_arch() > early_quirks() # arch/x86/kernel/early-quirks.c > early_pci_clear_msi() > ... > X > ... > pci_scan_root_bus_bridge() > ... > DECLARE_PCI_FIXUP_EARLY # drivers/pci/quirks.c > > I want to know specifically what X is. If we don't know what X is and > all we know is "we have to disable MSIs earlier than PCI init", then > we're likely to break things again in the future by changing the order > of disabling MSIs and whatever X is. > > Bjorn > Hi Bjorn (and all CCed), I'm sorry to necro-bump a thread >2 years later, but recent discussions led to a better understanding of this 'X' point, thanks to Thomas! For those that deleted this thread from their email clients, it's available in [0] - the summary is that we faced an IRQ storm really early in boot, due to a bogus PCIe device MSI behavior, when booting a kdump kernel. This led the machine to get stuck in the boot and we couldn't kdump. The solution hereby proposed is to clear MSI interrupts early in x86, if a parameter is provided. I don't have the reproducer anymore and it was pretty hard to reproduce in virtual environments. So, about the 'X' Bjorn, in another recent thread about IRQ storms [1], Thomas clarified that and after a brief discussion, it seems there's no better way to prevent the MSI storm other than clearing the MSI capability early in boot. As discussed both here and in thread [1], this is indeed a per-architecture issue (powerpc is not subject to that, due to a better FW reset mechanism), so I think we still could benefit in having this idea implemented upstream, at least in x86 (we could expand to other architectures if desired, in the future). As a "test" data point, this was implemented in Ubuntu (same 3 patches present in this series) for ~2 years and we haven't received bug reports - I'm saying that because I understand your concerns about expanding the early PCI quirks scope. Let me know your thoughts. I'd suggest all to read thread [1], which addresses a similar issue but in a different "moment" of the system boot and provides some more insight on why the early MSI clearing seems to make sense. Thanks, Guilherme [0] https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpicc...@canonical.com [1] https://lore.kernel.org/lkml/87y2js3ghv@nanos.tec.linutronix.de