Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-12-14 Thread Guilherme Piccoli
Thank you for the clarification Bjorn! I was on vacation..sorry for my delay.

Closing the loop here, I understand we're not getting this patch
merged (due to its restriction to domain 0) and there was a suggestion
in the thread of trying to block MSIs from the IOMMU init code (which
also have the restriction of only working in iommu-enabled systems).
Hope the thread is helpful and if somebody faces this issue, can
comment here and at least find this approach, maybe test the patch.

Thanks to all involved!


Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-30 Thread Bjorn Helgaas
On Wed, Nov 18, 2020 at 07:36:08PM -0300, Guilherme Piccoli wrote:
> Thanks a lot Bjorn! I confess except for PPC64 Server machines, I
> never saw other "domains" or segments. Is it common in x86 to have
> that? The early_quirks() are restricted to the first segment, no
> matter how many host bridges we have in segment ?

I don't know whether it's *common* to have multiple domains on x86,
but they're definitely used on large systems.  This includes some
lspci info from an HPE Superdome Flex system:
https://lore.kernel.org/lkml/5350e02a-6457-41a8-8f33-af67bfdae...@fb.com/

The early quirks in arch/x86/kernel/early-quirks.c (not the
DECLARE_PCI_FIXUP_EARLY quirks in drivers/pci/quirks.c) are restricted
to segment 0, no matter how many host bridges there are.  This is
because they use read_pci_config_16(), which uses a config access
mechanism that has no provision for a segment number.

Bjorn


Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-18 Thread Guilherme Piccoli
Thanks a lot Bjorn! I confess except for PPC64 Server machines, I
never saw other "domains" or segments. Is it common in x86 to have
that? The early_quirks() are restricted to the first segment, no
matter how many host bridges we have in segment ?

Thanks again!


Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-18 Thread Bjorn Helgaas
On Tue, Nov 17, 2020 at 09:04:07AM -0300, Guilherme Piccoli wrote:

> Also, taking here the opportunity to clarify my understanding about
> the limitations of that approach: Bjorn, in our reproducer machine we
> had 3 parents in the PCI tree (as per lspci -t), :00, :ff and
> :80 - are those all under "segment 0" as per your verbiage?

Yes.  The "" is the PCI segment (or "domain" in the Linux PCI core).
It's common on x86 to have multiple host bridges in segment .


Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-17 Thread Eric W. Biederman
Thomas Gleixner  writes:

> On Tue, Nov 17 2020 at 12:19, David Woodhouse wrote:
>> On Tue, 2020-11-17 at 10:53 +0100, Thomas Gleixner wrote:
>>> But that does not solve the problem either simply because then the IOMMU
>>> will catch the rogue MSIs and you get an interrupt storm on the IOMMU
>>> error interrupt.
>>
>> Not if you can tell the IOMMU to stop reporting those errors.
>>
>> We can easily do it per-device for DMA errors; not quite sure what
>> granularity we have for interrupts. Perhaps the Intel IOMMU only lets
>> you set the Fault Processing Disable bit per IRTE entry, and you still
>> get faults for Compatibility Format interrupts? Not sure about AMD...
>
> It looks like the fault (DMAR) and event (AMD) interrupts can be
> disabled in the IOMMU. That might help to bridge the gap until the PCI
> bus is scanned in full glory and the devices can be shut up for real.
>
> If we make this conditional for a crash dump kernel that might do the
> trick.
>
> Lot's of _might_ there :)

Worth testing.

Folks tracking this down is this enough of a hint for you to write a
patch and test?

Also worth checking how close irqpoll is to handling a case like this.
At least historically it did a pretty good job at shutting down problem
interrupts.

I really find it weird that an edge triggered irq was firing fast enough
to stop a system from booting.  Level triggered irqs do that if they are
acknolwedged without actually being handled.  I think edge triggered
irqs only fire when another event comes in, and it seems odd to get so
many actual events causing interrupts that the system soft locks
up.  Is my memory of that situation confused?

I recommend making these facilities general debug facilities so that
they can be used for cases other than crash dump.  The crash dump kernel
would always enable them because it can assume that the hardware is
very likely in a wonky state.

Eric



Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-17 Thread Thomas Gleixner
On Tue, Nov 17 2020 at 12:19, David Woodhouse wrote:
> On Tue, 2020-11-17 at 10:53 +0100, Thomas Gleixner wrote:
>> But that does not solve the problem either simply because then the IOMMU
>> will catch the rogue MSIs and you get an interrupt storm on the IOMMU
>> error interrupt.
>
> Not if you can tell the IOMMU to stop reporting those errors.
>
> We can easily do it per-device for DMA errors; not quite sure what
> granularity we have for interrupts. Perhaps the Intel IOMMU only lets
> you set the Fault Processing Disable bit per IRTE entry, and you still
> get faults for Compatibility Format interrupts? Not sure about AMD...

It looks like the fault (DMAR) and event (AMD) interrupts can be
disabled in the IOMMU. That might help to bridge the gap until the PCI
bus is scanned in full glory and the devices can be shut up for real.

If we make this conditional for a crash dump kernel that might do the
trick.

Lot's of _might_ there :)

Thanks

tglx







Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-17 Thread David Woodhouse
On Tue, 2020-11-17 at 10:53 +0100, Thomas Gleixner wrote:
> But that does not solve the problem either simply because then the IOMMU
> will catch the rogue MSIs and you get an interrupt storm on the IOMMU
> error interrupt.

Not if you can tell the IOMMU to stop reporting those errors.

We can easily do it per-device for DMA errors; not quite sure what
granularity we have for interrupts. Perhaps the Intel IOMMU only lets
you set the Fault Processing Disable bit per IRTE entry, and you still
get faults for Compatibility Format interrupts? Not sure about AMD...


smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-17 Thread Guilherme Piccoli
On Mon, Nov 16, 2020 at 10:07 PM Eric W. Biederman
 wrote:
> [...]
> > I think we need to disable MSIs in the crashing kernel before the
> > kexec.  It adds a little more code in the crash_kexec() path, but it
> > seems like a worthwhile tradeoff.
>
> Disabling MSIs in the b0rken kernel is not possible.
>
> Walking the device tree or even a significant subset of it hugely
> decreases the chances that we will run into something that is incorrect
> in the known broken kernel.  I expect the code to do that would double
> or triple the amount of code that must be executed in the known broken
> kernel.  The last time something like that happened (switching from xchg
> to ordinary locks) we had cases that stopped working.  Walking all of
> the pci devices in the system is much more invasive.
>

I think we could try to decouple this problem in 2, if that makes
sense. Bjorn/others can jump in and correct me if I'm wrong. So, the
problem is to walk a PCI topology tree, identify every device and
disable MSI(/INTx maybe) in them, and the big deal with doing that in
the broken kernel before the kexec is that this task is complex due to
the tree walk, mainly. But what if we keep a table (a simple 2-D
array) with the address and data to be written to disable the MSIs,
and before the kexec we could have a parameter enabling a function
that just goes through this array and performs the writes?

The table itself would be constructed by the PCI core (and updated in
the hotplug path), when device discovery is happening. This table
would live in a protected area in memory, with no write/execute
access, this way if the kernel itself tries to corrupt that, we get a
fault (yeah, I know DMAs can corrupt anywhere, but IOMMU could protect
against that). If the parameter "kdump_clear_msi" is passed in the
cmdline of the regular kernel, for example, then the function walks
this simple table and performs the devices' writes before the kexec...

If that's not possible or a bad idea for any reason, I still think the
early_quirks() idea hereby proposed is not something we should
discard, because it'll help a lot of users even with its limitations
(in our case it worked very well).
Also, taking here the opportunity to clarify my understanding about
the limitations of that approach: Bjorn, in our reproducer machine we
had 3 parents in the PCI tree (as per lspci -t), :00, :ff and
:80 - are those all under "segment 0" as per your verbiage? In our
case the troublemaker was under :80, and the early_quirks() shut
the device up successfully.

Thanks,


Guilherme


Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-17 Thread Thomas Gleixner
On Mon, Nov 16 2020 at 19:06, Eric W. Biederman wrote:
> Bjorn Helgaas  writes:
> My two top candidates are poking the IOMMUs early to shut things off,
> and figuring out if we can delay enabling interrupts until we have
> initialized pci.

Keeping interrupts disabled post PCI initialization would be nice, but
that requires feeding the whole init machinery through a shredder and
collecting the bits and pieces.

> Poking at IOMMUs early should work for most systems with ``enterprise''
> hardware.  Systems where people care about kdump the most.

The IOMMU IRQ remapping part _is_ initialized early and _before_
interrupts are enabled.

But that does not solve the problem either simply because then the IOMMU
will catch the rogue MSIs and you get an interrupt storm on the IOMMU
error interrupt.

This one is not going to be turned off because the IOMMU error interrupt
handler will handle each of them and tell the core code that everything
is under control.

As I explained several times now, the only way to shut this up reliably
is at the source. Curing the symptom is almost never a good solution.

Thanks,

tglx


Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-16 Thread Eric W. Biederman
Bjorn Helgaas  writes:

> I don't think passing the device information to the kdump kernel is
> really practical.  The kdump kernel would use it to do PCI config
> writes to disable MSIs before enabling IRQs, and it doesn't know how
> to access config space that early.

I don't think it is particularly practical either.  But in practice
on x86 it is either mmio writes or 0xcf8 style writes and we could
pass a magic table that would have all of that information.

> We could invent special "early config access" things, but that gets
> really complicated really fast.  Config access depends on ACPI MCFG
> tables, firmware interfaces, and in many cases, on the native host
> bridge drivers in drivers/pci/controllers/.

I do agree that the practical problem with passing information early
is that gets us into the weeds and creates code that we only care
about in the case of kexec-on-panic.  It is much better to make the
existing code more robust, so that we reduce our dependency on firmware
doing the right thing.

> I think we need to disable MSIs in the crashing kernel before the
> kexec.  It adds a little more code in the crash_kexec() path, but it
> seems like a worthwhile tradeoff.

Disabling MSIs in the b0rken kernel is not possible.

Walking the device tree or even a significant subset of it hugely
decreases the chances that we will run into something that is incorrect
in the known broken kernel.  I expect the code to do that would double
or triple the amount of code that must be executed in the known broken
kernel.  The last time something like that happened (switching from xchg
to ordinary locks) we had cases that stopped working.  Walking all of
the pci devices in the system is much more invasive.

That is not to downplay the problems of figuring out how to disable
things in early boot.

My two top candidates are poking the IOMMUs early to shut things off,
and figuring out if we can delay enabling interrupts until we have
initialized pci.

Poking at IOMMUs early should work for most systems with ``enterprise''
hardware.  Systems where people care about kdump the most.

Eric


Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-16 Thread Bjorn Helgaas
On Mon, Nov 16, 2020 at 05:31:36PM -0300, Guilherme G. Piccoli wrote:
> First of all, thanks everybody for the great insights/discussion! This
> thread ended-up being a great learning for (at least) me.
> 
> Given the big amount of replies and intermixed comments, I wouldn't be
> able to respond inline to all, so I'll try another approach below.
> 
> 
> From Bjorn:
> "I think [0] proposes using early_quirks() to disable MSIs at boot-time.
> That doesn't seem like a robust solution because (a) the problem affects
> all arches but early_quirks() is x86-specific and (b) even on x86
> early_quirks() only works for PCI segment 0 because it relies on the
> 0xCF8/0xCFC I/O ports."
> 
> Ah. I wasn't aware of that limitation, I thought enhancing the
> early_quirks() search to go through all buses would fix that, thanks for
> the clarification! And again, worth to clarify that this is not a
> problem affecting all arches _in practice_ - PowerPC for example has the
> FW primitives allowing a powerful PCI controller (out-of-band) reset,
> preventing this kind of issue usually.
> 
> [0]
> https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpicc...@canonical.com
> 
> 
> From Bjorn:
> "A crash_device_shutdown() could do something at the host bridge level
> if that's possible, or reset/disable bus mastering/disable MSI/etc on
> individual PCI devices if necessary."
> 
> From Lukas:
> "Guilherme's original patches from 2018 iterate over all 256 PCI buses.
> That might impact boot time negatively.  The reason he has to do that is
> because the crashing kernel doesn't know which devices exist and which
> have interrupts enabled.  However the crashing kernel has that
> information.  It should either disable interrupts itself or pass the
> necessary information to the crashing kernel as setup_data or whatever.

I don't think passing the device information to the kdump kernel is
really practical.  The kdump kernel would use it to do PCI config
writes to disable MSIs before enabling IRQs, and it doesn't know how
to access config space that early.

We could invent special "early config access" things, but that gets
really complicated really fast.  Config access depends on ACPI MCFG
tables, firmware interfaces, and in many cases, on the native host
bridge drivers in drivers/pci/controllers/.

I think we need to disable MSIs in the crashing kernel before the
kexec.  It adds a little more code in the crash_kexec() path, but it
seems like a worthwhile tradeoff.

Bjorn


Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-16 Thread Guilherme Piccoli
On Mon, Nov 16, 2020 at 6:45 PM Eric W. Biederman  wrote:
> The way to do that would be to collect the set of pci devices when the
> kexec on panic kernel is loaded, not during crash_kexec.  If someone
> performs a device hotplug they would need to reload the kexec on panic
> kernel.
>
> I am not necessarily endorsing that just pointing out how it can be
> done.
>
> Eric

Thanks Eric, I agree! I think if we use something like PKRAM (a more
dynamic approach) we could have the PCI hotplug path updating the data
to-be-passed to the crash kernel, so the crash kernel doesn't even
need to be loaded again.


Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-16 Thread Eric W. Biederman
"Guilherme G. Piccoli"  writes:

> First of all, thanks everybody for the great insights/discussion! This
> thread ended-up being a great learning for (at least) me.
>
> Given the big amount of replies and intermixed comments, I wouldn't be
> able to respond inline to all, so I'll try another approach below.
>
>
> From Bjorn:
> "I think [0] proposes using early_quirks() to disable MSIs at boot-time.
> That doesn't seem like a robust solution because (a) the problem affects
> all arches but early_quirks() is x86-specific and (b) even on x86
> early_quirks() only works for PCI segment 0 because it relies on the
> 0xCF8/0xCFC I/O ports."
>
> Ah. I wasn't aware of that limitation, I thought enhancing the
> early_quirks() search to go through all buses would fix that, thanks for
> the clarification! And again, worth to clarify that this is not a
> problem affecting all arches _in practice_ - PowerPC for example has the
> FW primitives allowing a powerful PCI controller (out-of-band) reset,
> preventing this kind of issue usually.
>
> [0]
> https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpicc...@canonical.com
>
>
> From Bjorn:
> "A crash_device_shutdown() could do something at the host bridge level
> if that's possible, or reset/disable bus mastering/disable MSI/etc on
> individual PCI devices if necessary."
>
> From Lukas:
> "Guilherme's original patches from 2018 iterate over all 256 PCI buses.
> That might impact boot time negatively.  The reason he has to do that is
> because the crashing kernel doesn't know which devices exist and which
> have interrupts enabled.  However the crashing kernel has that
> information.  It should either disable interrupts itself or pass the
> necessary information to the crashing kernel as setup_data or whatever.
>
> Guilherme's patches add a "clearmsi" command line parameter.  I guess
> the idea is that the parameter is always passed to the crash kernel but
> the patches don't do that, which seems odd."
>
> Very good points Lukas, thanks! The reason of not adding the clearmsi
> thing as a default kdump procedure is kinda related to your first point:
> it impacts a bit boot-time, also it's an additional logic in the kdump
> kernel, which we know is (sometimes) the last resort in gathering
> additional data to debug a potentially complex issue. That said, I'd not
> like to enforce this kind of "policy" to everybody, so my implementation
> relies on having it as a parameter, and the kdump userspace counter-part
> could then have a choice in adding or not such mechanism to the kdump
> kernel parameter list.
>
> About passing the data to next kernel, this is very interesting, we
> could do something like that either through setup_data (as you said) or
> using a new proposal, the PKRAM thing [a].
> This way we wouldn't need a crash_device_shutdown(), but instead when
> kernel is loading a crash kernel (through kexec -p) we can collect all
> the necessary information that'll be passed to the crash kernel
> (obviously that if we are collecting PCI topology information, we need
> callbacks in the PCI hotplug add/del path to update this information).
>
> [a]
> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yzn...@oracle.com/
>
> Below, inline reply to Eric's last message.
>
>
> On 15/11/2020 17:46, Eric W. Biederman wrote:
>>> [...]
>>> An MSI interrupt is a (DMA) write to the local APIC base address
>>> 0xfeex which has the target CPU and control bits encoded in the
>>> lower bits. The written data is the vector and more control bits.
>>>
>>> The only way to stop that is to shut it up at the PCI device level.
>> 
>> Or to write to magic chipset registers that will stop transforming DMA
>> writes to 0xfeex into x86 interrupts.  With an IOMMU I know x86 has
>> such registers (because the point of the IOMMU is to limit the problems
>> rogue devices can cause).  Without an IOMMU I don't know if x86 has any
>> such registers.  I remember that other platforms have an interrupt
>> controller that does provide some level of control.  That x86 does not
>> is what makes this an x86 specific problem.
>> [...]
>> Looking at patch 3/3 what this patchset does is an early disable of
>> of the msi registers.  Which is mostly reasonable.  Especially as has
>> been pointed out the only location the x86 vector and x86 cpu can
>> be found is in the msi configuration registers.
>> 
>> That also seems reasonable.  But Bjorn's concern about not finding all
>> devices in all domains does seem real.
>> [...]
>> So if we can safely and reliably disable DMA and MSI at the generic PCI
>> device level during boot up I am all for it.
>> 
>> 
>> How difficult would it be to tell the IOMMUs to stop passing traffic
>> through in an early pci quirk?  The problem setup was apparently someone
>> using the device directly from a VM.  So I presume there is an IOMMU
>> in that configuration.
>
> This is a good idea I think - we considered something like that in
> theory while 

Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-16 Thread Guilherme G. Piccoli
First of all, thanks everybody for the great insights/discussion! This
thread ended-up being a great learning for (at least) me.

Given the big amount of replies and intermixed comments, I wouldn't be
able to respond inline to all, so I'll try another approach below.


>From Bjorn:
"I think [0] proposes using early_quirks() to disable MSIs at boot-time.
That doesn't seem like a robust solution because (a) the problem affects
all arches but early_quirks() is x86-specific and (b) even on x86
early_quirks() only works for PCI segment 0 because it relies on the
0xCF8/0xCFC I/O ports."

Ah. I wasn't aware of that limitation, I thought enhancing the
early_quirks() search to go through all buses would fix that, thanks for
the clarification! And again, worth to clarify that this is not a
problem affecting all arches _in practice_ - PowerPC for example has the
FW primitives allowing a powerful PCI controller (out-of-band) reset,
preventing this kind of issue usually.

[0]
https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpicc...@canonical.com


>From Bjorn:
"A crash_device_shutdown() could do something at the host bridge level
if that's possible, or reset/disable bus mastering/disable MSI/etc on
individual PCI devices if necessary."

>From Lukas:
"Guilherme's original patches from 2018 iterate over all 256 PCI buses.
That might impact boot time negatively.  The reason he has to do that is
because the crashing kernel doesn't know which devices exist and which
have interrupts enabled.  However the crashing kernel has that
information.  It should either disable interrupts itself or pass the
necessary information to the crashing kernel as setup_data or whatever.

Guilherme's patches add a "clearmsi" command line parameter.  I guess
the idea is that the parameter is always passed to the crash kernel but
the patches don't do that, which seems odd."

Very good points Lukas, thanks! The reason of not adding the clearmsi
thing as a default kdump procedure is kinda related to your first point:
it impacts a bit boot-time, also it's an additional logic in the kdump
kernel, which we know is (sometimes) the last resort in gathering
additional data to debug a potentially complex issue. That said, I'd not
like to enforce this kind of "policy" to everybody, so my implementation
relies on having it as a parameter, and the kdump userspace counter-part
could then have a choice in adding or not such mechanism to the kdump
kernel parameter list.

About passing the data to next kernel, this is very interesting, we
could do something like that either through setup_data (as you said) or
using a new proposal, the PKRAM thing [a].
This way we wouldn't need a crash_device_shutdown(), but instead when
kernel is loading a crash kernel (through kexec -p) we can collect all
the necessary information that'll be passed to the crash kernel
(obviously that if we are collecting PCI topology information, we need
callbacks in the PCI hotplug add/del path to update this information).

[a]
https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yzn...@oracle.com/

Below, inline reply to Eric's last message.


On 15/11/2020 17:46, Eric W. Biederman wrote:
>> [...]
>> An MSI interrupt is a (DMA) write to the local APIC base address
>> 0xfeex which has the target CPU and control bits encoded in the
>> lower bits. The written data is the vector and more control bits.
>>
>> The only way to stop that is to shut it up at the PCI device level.
> 
> Or to write to magic chipset registers that will stop transforming DMA
> writes to 0xfeex into x86 interrupts.  With an IOMMU I know x86 has
> such registers (because the point of the IOMMU is to limit the problems
> rogue devices can cause).  Without an IOMMU I don't know if x86 has any
> such registers.  I remember that other platforms have an interrupt
> controller that does provide some level of control.  That x86 does not
> is what makes this an x86 specific problem.
> [...]
> Looking at patch 3/3 what this patchset does is an early disable of
> of the msi registers.  Which is mostly reasonable.  Especially as has
> been pointed out the only location the x86 vector and x86 cpu can
> be found is in the msi configuration registers.
> 
> That also seems reasonable.  But Bjorn's concern about not finding all
> devices in all domains does seem real.
> [...]
> So if we can safely and reliably disable DMA and MSI at the generic PCI
> device level during boot up I am all for it.
> 
> 
> How difficult would it be to tell the IOMMUs to stop passing traffic
> through in an early pci quirk?  The problem setup was apparently someone
> using the device directly from a VM.  So I presume there is an IOMMU
> in that configuration.

This is a good idea I think - we considered something like that in
theory while working the problem back in 2018, but given I'm even less
expert in IOMMU that I already am in PCI, the option was to go with the
PCI approach. And you are right, the original problem is a 

Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-15 Thread Eric W. Biederman
Thomas Gleixner  writes:

> On Sun, Nov 15 2020 at 08:29, Eric W. Biederman wrote:
>> ebied...@xmission.com (Eric W. Biederman) writes:
>> For ordinary irqs you can have this with level triggered irqs
>> and the kernel has code that will shutdown the irq at the ioapic
>> level.  Then the kernel continues by polling the irq source.
>>
>> I am still missing details but my first question is can our general
>> solution to screaming level triggered irqs apply?
>
> No.
>
>> How can edge triggered MSI irqs be screaming?
>>
>> Is there something we can do in enabling the APICs or IOAPICs that
>> would allow this to be handled better.  My memory when we enable
>> the APICs and IOAPICs we completely clear the APIC entries and so
>> should be disabling sources.
>
> Yes, but MSI has nothing to do with APIC/IOAPIC

Yes, sorry.  It has been long enough that the details were paged out
of my memory.

>> Is the problem perhaps that we wind up using an APIC entry that was
>> previously used for the MSI interrupt as something else when we
>> reprogram them?  Even with this why doesn't the generic code
>> to stop screaming irqs apply here?
>
> Again. No. The problem cannot be solved at the APIC level. The APIC is
> the receiving end of MSI and has absolutely no control over it.
>
> An MSI interrupt is a (DMA) write to the local APIC base address
> 0xfeex which has the target CPU and control bits encoded in the
> lower bits. The written data is the vector and more control bits.
>
> The only way to stop that is to shut it up at the PCI device level.

Or to write to magic chipset registers that will stop transforming DMA
writes to 0xfeex into x86 interrupts.  With an IOMMU I know x86 has
such registers (because the point of the IOMMU is to limit the problems
rogue devices can cause).  Without an IOMMU I don't know if x86 has any
such registers.  I remember that other platforms have an interrupt
controller that does provide some level of control.  That x86 does not
is what makes this an x86 specific problem.

The generic solution is to have the PCI code set bus master disables
when it is enumerationg and initializing devices.  Last time I was
paying attention that was actually the policy of the pci layer and
drivers that did not enable bus mastering were considered buggy.

Looking at patch 3/3 what this patchset does is an early disable of
of the msi registers.  Which is mostly reasonable.  Especially as has
been pointed out the only location the x86 vector and x86 cpu can
be found is in the msi configuration registers.

That also seems reasonable.  But Bjorn's concern about not finding all
devices in all domains does seem real.

There are a handful of devices where the Bus master disable bit doesn't
disable bus mastering.  I wonder if there are devices where MSI and MSIX
disables don't fully work.  It seems completely possible to have MSI or
MSIX equivalent registers at a non-standard location as drivers must be
loaded to handle them.

So if we can safely and reliably disable DMA and MSI at the generic PCI
device level during boot up I am all for it.


How difficult would it be to tell the IOMMUs to stop passing traffic
through in an early pci quirk?  The problem setup was apparently someone
using the device directly from a VM.  So I presume there is an IOMMU
in that configuration.

> Unfortunately there is no way to tell the APIC "Mask vector X" and the
> dump kernel does neither know which device it comes from nor does it
> have enumerated PCI completely which would reset the device and shutup
> the spew. Due to the interrupt storm it does not get that far.

So the question is how do we make this robust?

Can we perhaps disable all interrupts in this case and limp along
in polling mode until the pci bus has been enumerated?

It is nice and desirable to be able to use the hardware in high
performance mode in a kexec-on-panic situation but if we can detect a
problem and figure out how to limp along sometimes that is acceptable.

The failure mode in the kexec-on-panic kernel is definitely the corect
one.  We could not figure out how to wrestle the hardware into usability
so we fail to take a crash dump or do anything else that might corrupt
the system.

Eric


Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-15 Thread Thomas Gleixner
On Sun, Nov 15 2020 at 18:01, Lukas Wunner wrote:
> On Sun, Nov 15, 2020 at 04:11:43PM +0100, Thomas Gleixner wrote:
>> Unfortunately there is no way to tell the APIC "Mask vector X" and the
>> dump kernel does neither know which device it comes from nor does it
>> have enumerated PCI completely which would reset the device and shutup
>> the spew. Due to the interrupt storm it does not get that far.
>
> Can't we just set DisINTx, clear MSI Enable and clear MSI-X Enable
> on all active PCI devices in the crashing kernel before starting the
> crash kernel?  So that the crash kernel starts with a clean slate?
>
> Guilherme's original patches from 2018 iterate over all 256 PCI buses.
> That might impact boot time negatively.  The reason he has to do that
> is because the crashing kernel doesn't know which devices exist and
> which have interrupts enabled.  However the crashing kernel has that
> information.  It should either disable interrupts itself or pass the
> necessary information to the crashing kernel as setup_data or whatever.

As I explained before: The problem with doing anything between crashing
and starting the crash kernel is reducing the chance of actually
starting it. The kernel crashed for whatever reason, so it's in a
completely undefined state.

Ergo there is no 'just do something'. You really have to think hard
about what can be done safely with the least probability of running into
another problem.

Thanks,

tglx




Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-15 Thread Lukas Wunner
On Sun, Nov 15, 2020 at 04:11:43PM +0100, Thomas Gleixner wrote:
> Unfortunately there is no way to tell the APIC "Mask vector X" and the
> dump kernel does neither know which device it comes from nor does it
> have enumerated PCI completely which would reset the device and shutup
> the spew. Due to the interrupt storm it does not get that far.

Can't we just set DisINTx, clear MSI Enable and clear MSI-X Enable
on all active PCI devices in the crashing kernel before starting the
crash kernel?  So that the crash kernel starts with a clean slate?

Guilherme's original patches from 2018 iterate over all 256 PCI buses.
That might impact boot time negatively.  The reason he has to do that
is because the crashing kernel doesn't know which devices exist and
which have interrupts enabled.  However the crashing kernel has that
information.  It should either disable interrupts itself or pass the
necessary information to the crashing kernel as setup_data or whatever.

Guilherme's patches add a "clearmsi" command line parameter.  I guess
the idea is that the parameter is always passed to the crash kernel
but the patches don't do that, which seems odd.

Thanks,

Lukas


Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-15 Thread Thomas Gleixner
On Sun, Nov 15 2020 at 08:29, Eric W. Biederman wrote:
> ebied...@xmission.com (Eric W. Biederman) writes:
> For ordinary irqs you can have this with level triggered irqs
> and the kernel has code that will shutdown the irq at the ioapic
> level.  Then the kernel continues by polling the irq source.
>
> I am still missing details but my first question is can our general
> solution to screaming level triggered irqs apply?

No.

> How can edge triggered MSI irqs be screaming?
>
> Is there something we can do in enabling the APICs or IOAPICs that
> would allow this to be handled better.  My memory when we enable
> the APICs and IOAPICs we completely clear the APIC entries and so
> should be disabling sources.

Yes, but MSI has nothing to do with APIC/IOAPIC

> Is the problem perhaps that we wind up using an APIC entry that was
> previously used for the MSI interrupt as something else when we
> reprogram them?  Even with this why doesn't the generic code
> to stop screaming irqs apply here?

Again. No. The problem cannot be solved at the APIC level. The APIC is
the receiving end of MSI and has absolutely no control over it.

An MSI interrupt is a (DMA) write to the local APIC base address
0xfeex which has the target CPU and control bits encoded in the
lower bits. The written data is the vector and more control bits.

The only way to stop that is to shut it up at the PCI device level.

Assume the following situation:

  - PCI device has MSI enabled and a valid target vector assigned

  - Kernel crashes

  - Kdump kernel starts

  - PCI device raises interrupts which result in the MSI write

  - Kdump kernel enables interrupts and the pending vector is raised in
the CPU.

  - The CPU has no interrupt descriptor assigned to the vector
and does not even know where the interrupt originated from. So it
treats it like any other spurious interrupt to an unassigned vector,
emits a ratelimited message and ACKs the interrupt at the APIC.

  - PCI device behaves stupid and reraises the interrupt for whatever
reason.

  - Lather, rinse and repeat.

Unfortunately there is no way to tell the APIC "Mask vector X" and the
dump kernel does neither know which device it comes from nor does it
have enumerated PCI completely which would reset the device and shutup
the spew. Due to the interrupt storm it does not get that far.

Thanks,

tglx


Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-15 Thread Eric W. Biederman
ebied...@xmission.com (Eric W. Biederman) writes:

> Bjorn Helgaas  writes:
>
>> [+cc Rafael for question about ACPI method for PCI host bridge reset]
>>
>> On Sat, Nov 14, 2020 at 09:58:08PM +0100, Thomas Gleixner wrote:
>>> On Sat, Nov 14 2020 at 14:39, Bjorn Helgaas wrote:
>>> > On Sat, Nov 14, 2020 at 12:40:10AM +0100, Thomas Gleixner wrote:
>>> >> On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote:
>>> >> > On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote:
>>> >> >> pci_device_shutdown() still clears the Bus Master Enable bit if we're
>>> >> >> doing a kexec and the device is in D0-D3hot, which should also disable
>>> >> >> MSI/MSI-X.  Why doesn't this solve the problem?  Is this because the
>>> >> >> device causing the storm was in PCI_UNKNOWN state?
>>> >> >
>>> >> > That's indeed a really good question.
>>> >> 
>>> >> So we do that on kexec, but is that true when starting a kdump kernel
>>> >> from a kernel crash? I doubt it.
>>> >
>>> > Ah, right, I bet that's it, thanks.  The kdump path is basically this:
>>> >
>>> >   crash_kexec
>>> > machine_kexec
>>> >
>>> > while the usual kexec path is:
>>> >
>>> >   kernel_kexec
>>> > kernel_restart_prepare
>>> >   device_shutdown
>>> > while (!list_empty(_kset->list))
>>> >   dev->bus->shutdown
>>> > pci_device_shutdown# pci_bus_type.shutdown
>>> > machine_kexec
>>> >
>>> > So maybe we need to explore doing some or all of device_shutdown() in
>>> > the crash_kexec() path as well as in the kernel_kexec() path.
>>> 
>>> The problem is that if the machine crashed anything you try to attempt
>>> before starting the crash kernel is reducing the chance that the crash
>>> kernel actually starts.
>>
>> Right.
>>
>>> Is there something at the root bridge level which allows to tell the
>>> underlying busses to shut up, reset or go into a defined state? That
>>> might avoid chasing lists which might be already unreliable.
>>
>> Maybe we need some kind of crash_device_shutdown() that does the
>> minimal thing to protect the kdump kernel from devices.
>
> The kdump kernel does not use any memory the original kernel uses.
> Which should be a minimal and fairly robust level of protection
> until the device drivers can be loaded and get ahold of things.
>
>> The programming model for conventional PCI host bridges and PCIe Root
>> Complexes is device-specific since they're outside the PCI domain.
>> There probably *are* ways to do those things, but you would need a
>> native host bridge driver or something like an ACPI method.  I'm not
>> aware of an ACPI way to do this, but I added Rafael in case he is.
>>
>> A crash_device_shutdown() could do something at the host bridge level
>> if that's possible, or reset/disable bus mastering/disable MSI/etc on
>> individual PCI devices if necessary.
>
> Unless I am confused DMA'ing to memory that is not already in use
> is completely broken wether or not you are using the kdump kernel.

Bah.  I was confused because I had not read up-thread.

MSI mixes DMA and irqs so confusion is easy.

So the problem is screaming irqs when the kernel is booting up.
This is a fundamentally tricky problem.

For ordinary irqs you can have this with level triggered irqs
and the kernel has code that will shutdown the irq at the ioapic
level.  Then the kernel continues by polling the irq source.

I am still missing details but my first question is can our general
solution to screaming level triggered irqs apply?

How can edge triggered MSI irqs be screaming?

Is there something we can do in enabling the APICs or IOAPICs that
would allow this to be handled better.  My memory when we enable
the APICs and IOAPICs we completely clear the APIC entries and so
should be disabling sources.

Is the problem perhaps that we wind up using an APIC entry that was
previously used for the MSI interrupt as something else when we
reprogram them?  Even with this why doesn't the generic code
to stop screaming irqs apply here?

Eric


Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-15 Thread Eric W. Biederman
Bjorn Helgaas  writes:

> [+cc Rafael for question about ACPI method for PCI host bridge reset]
>
> On Sat, Nov 14, 2020 at 09:58:08PM +0100, Thomas Gleixner wrote:
>> On Sat, Nov 14 2020 at 14:39, Bjorn Helgaas wrote:
>> > On Sat, Nov 14, 2020 at 12:40:10AM +0100, Thomas Gleixner wrote:
>> >> On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote:
>> >> > On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote:
>> >> >> pci_device_shutdown() still clears the Bus Master Enable bit if we're
>> >> >> doing a kexec and the device is in D0-D3hot, which should also disable
>> >> >> MSI/MSI-X.  Why doesn't this solve the problem?  Is this because the
>> >> >> device causing the storm was in PCI_UNKNOWN state?
>> >> >
>> >> > That's indeed a really good question.
>> >> 
>> >> So we do that on kexec, but is that true when starting a kdump kernel
>> >> from a kernel crash? I doubt it.
>> >
>> > Ah, right, I bet that's it, thanks.  The kdump path is basically this:
>> >
>> >   crash_kexec
>> > machine_kexec
>> >
>> > while the usual kexec path is:
>> >
>> >   kernel_kexec
>> > kernel_restart_prepare
>> >   device_shutdown
>> > while (!list_empty(_kset->list))
>> >   dev->bus->shutdown
>> > pci_device_shutdown# pci_bus_type.shutdown
>> > machine_kexec
>> >
>> > So maybe we need to explore doing some or all of device_shutdown() in
>> > the crash_kexec() path as well as in the kernel_kexec() path.
>> 
>> The problem is that if the machine crashed anything you try to attempt
>> before starting the crash kernel is reducing the chance that the crash
>> kernel actually starts.
>
> Right.
>
>> Is there something at the root bridge level which allows to tell the
>> underlying busses to shut up, reset or go into a defined state? That
>> might avoid chasing lists which might be already unreliable.
>
> Maybe we need some kind of crash_device_shutdown() that does the
> minimal thing to protect the kdump kernel from devices.

The kdump kernel does not use any memory the original kernel uses.
Which should be a minimal and fairly robust level of protection
until the device drivers can be loaded and get ahold of things.

> The programming model for conventional PCI host bridges and PCIe Root
> Complexes is device-specific since they're outside the PCI domain.
> There probably *are* ways to do those things, but you would need a
> native host bridge driver or something like an ACPI method.  I'm not
> aware of an ACPI way to do this, but I added Rafael in case he is.
>
> A crash_device_shutdown() could do something at the host bridge level
> if that's possible, or reset/disable bus mastering/disable MSI/etc on
> individual PCI devices if necessary.

Unless I am confused DMA'ing to memory that is not already in use
is completely broken wether or not you are using the kdump kernel.

Eric


Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-14 Thread Bjorn Helgaas
[+cc Rafael for question about ACPI method for PCI host bridge reset]

On Sat, Nov 14, 2020 at 09:58:08PM +0100, Thomas Gleixner wrote:
> On Sat, Nov 14 2020 at 14:39, Bjorn Helgaas wrote:
> > On Sat, Nov 14, 2020 at 12:40:10AM +0100, Thomas Gleixner wrote:
> >> On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote:
> >> > On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote:
> >> >> pci_device_shutdown() still clears the Bus Master Enable bit if we're
> >> >> doing a kexec and the device is in D0-D3hot, which should also disable
> >> >> MSI/MSI-X.  Why doesn't this solve the problem?  Is this because the
> >> >> device causing the storm was in PCI_UNKNOWN state?
> >> >
> >> > That's indeed a really good question.
> >> 
> >> So we do that on kexec, but is that true when starting a kdump kernel
> >> from a kernel crash? I doubt it.
> >
> > Ah, right, I bet that's it, thanks.  The kdump path is basically this:
> >
> >   crash_kexec
> > machine_kexec
> >
> > while the usual kexec path is:
> >
> >   kernel_kexec
> > kernel_restart_prepare
> >   device_shutdown
> > while (!list_empty(_kset->list))
> >   dev->bus->shutdown
> > pci_device_shutdown# pci_bus_type.shutdown
> > machine_kexec
> >
> > So maybe we need to explore doing some or all of device_shutdown() in
> > the crash_kexec() path as well as in the kernel_kexec() path.
> 
> The problem is that if the machine crashed anything you try to attempt
> before starting the crash kernel is reducing the chance that the crash
> kernel actually starts.

Right.

> Is there something at the root bridge level which allows to tell the
> underlying busses to shut up, reset or go into a defined state? That
> might avoid chasing lists which might be already unreliable.

Maybe we need some kind of crash_device_shutdown() that does the
minimal thing to protect the kdump kernel from devices.

The programming model for conventional PCI host bridges and PCIe Root
Complexes is device-specific since they're outside the PCI domain.
There probably *are* ways to do those things, but you would need a
native host bridge driver or something like an ACPI method.  I'm not
aware of an ACPI way to do this, but I added Rafael in case he is.

A crash_device_shutdown() could do something at the host bridge level
if that's possible, or reset/disable bus mastering/disable MSI/etc on
individual PCI devices if necessary.

Bjorn


Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-14 Thread Thomas Gleixner
Bjorn,

On Sat, Nov 14 2020 at 14:39, Bjorn Helgaas wrote:
> On Sat, Nov 14, 2020 at 12:40:10AM +0100, Thomas Gleixner wrote:
>> On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote:
>> > On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote:
>> >> pci_device_shutdown() still clears the Bus Master Enable bit if we're
>> >> doing a kexec and the device is in D0-D3hot, which should also disable
>> >> MSI/MSI-X.  Why doesn't this solve the problem?  Is this because the
>> >> device causing the storm was in PCI_UNKNOWN state?
>> >
>> > That's indeed a really good question.
>> 
>> So we do that on kexec, but is that true when starting a kdump kernel
>> from a kernel crash? I doubt it.
>
> Ah, right, I bet that's it, thanks.  The kdump path is basically this:
>
>   crash_kexec
> machine_kexec
>
> while the usual kexec path is:
>
>   kernel_kexec
> kernel_restart_prepare
>   device_shutdown
> while (!list_empty(_kset->list))
>   dev->bus->shutdown
> pci_device_shutdown# pci_bus_type.shutdown
> machine_kexec
>
> So maybe we need to explore doing some or all of device_shutdown() in
> the crash_kexec() path as well as in the kernel_kexec() path.

The problem is that if the machine crashed anything you try to attempt
before starting the crash kernel is reducing the chance that the crash
kernel actually starts.

Is there something at the root bridge level which allows to tell the
underlying busses to shut up, reset or go into a defined state? That
might avoid chasing lists which might be already unreliable.

Thanks,

tglx


Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-14 Thread Bjorn Helgaas
On Sat, Nov 14, 2020 at 12:40:10AM +0100, Thomas Gleixner wrote:
> Bjorn,
> 
> On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote:
> > On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote:
> >> pci_device_shutdown() still clears the Bus Master Enable bit if we're
> >> doing a kexec and the device is in D0-D3hot, which should also disable
> >> MSI/MSI-X.  Why doesn't this solve the problem?  Is this because the
> >> device causing the storm was in PCI_UNKNOWN state?
> >
> > That's indeed a really good question.
> 
> So we do that on kexec, but is that true when starting a kdump kernel
> from a kernel crash? I doubt it.

Ah, right, I bet that's it, thanks.  The kdump path is basically this:

  crash_kexec
machine_kexec

while the usual kexec path is:

  kernel_kexec
kernel_restart_prepare
  device_shutdown
while (!list_empty(_kset->list))
  dev->bus->shutdown
pci_device_shutdown# pci_bus_type.shutdown
machine_kexec

So maybe we need to explore doing some or all of device_shutdown() in
the crash_kexec() path as well as in the kernel_kexec() path.


Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-13 Thread Thomas Gleixner
Bjorn,

On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote:
> On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote:
>> pci_device_shutdown() still clears the Bus Master Enable bit if we're
>> doing a kexec and the device is in D0-D3hot, which should also disable
>> MSI/MSI-X.  Why doesn't this solve the problem?  Is this because the
>> device causing the storm was in PCI_UNKNOWN state?
>
> That's indeed a really good question.

So we do that on kexec, but is that true when starting a kdump kernel
from a kernel crash? I doubt it.

Thanks,

tglx



Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-13 Thread Thomas Gleixner
Bjorn,

On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote:
> On Fri, Nov 06, 2020 at 10:14:14AM -0300, Guilherme G. Piccoli wrote:
>> On 23/10/2018 14:03, Bjorn Helgaas wrote:
> I guess Thomas' patch [2] (from thread [1]) doesn't solve this
> problem?

No. As I explained in [1] patch from [2] cannot solve it because the
patch from [2] which is what Liu was trying to solve requires that there
is a registered interrupt handler which knows how to shut up the
interrupt.

> I think [0] proposes using early_quirks() to disable MSIs at
> boot-time.  That doesn't seem like a robust solution because (a) the
> problem affects all arches but early_quirks() is x86-specific and (b)
> even on x86 early_quirks() only works for PCI segment 0 because it
> relies on the 0xCF8/0xCFC I/O ports.
>
> If I understand Thomas' email correctly, the IRQ storm occurs here:
>
>   start_kernel
> setup_arch
>   early_quirks   # x86-only
> ...
>   read_pci_config_16(num, slot, func, PCI_VENDOR_ID)
> outl(..., 0xcf8) # PCI segment 0 only
> inw(0xcfc)
> local_irq_enable
>   ...
> native_irq_enable
>   asm("sti") # <-- enable IRQ, storm occurs
>
> native_irq_enable() happens long before we discover PCI host bridges
> and run the normal PCI quirks, so those would be too late to disable
> MSIs.

Correct.

> It doesn't seem practical to disable MSIs in the kdump kernel at the
> PCI level.  I was hoping we could disable them somewhere in the IRQ
> code, e.g., at IOAPICs, but I think Thomas is saying that's not
> feasible.

MSIs are not even going near the IOAPIC and as long as the interrupt
core does not have an interrupt set up for the device is has no idea
where to look at to shut it down. Actually it does not even reach the
interrupt core. The raised vector arrives at the CPU and the low level
code sees: No handler associated, ignore it. We cannot do anything from
the low level code because all we know is that the vector was raised,
but we have absolutely zero clue where that came from. At that point the
IO-APIC interrupts are definitely not the problem because they are all
disabled.

> It seems like the only option left is to disable MSIs before the
> kexec.  We used to clear the MSI/MSI-X Enable bits in
> pci_device_shutdown(), but that broke console devices that relied on
> MSI and caused "nobody cared" warnings when the devices fell back to
> using INTx, so fda78d7a0ead ("PCI/MSI: Stop disabling MSI/MSI-X in
> pci_device_shutdown()") left them unchanged.

That might be solvable because INTx arrives at the IO-APIC and we could
mask all the INTx related IO-APIC lines, but that's icky because of
this:

> pci_device_shutdown() still clears the Bus Master Enable bit if we're
> doing a kexec and the device is in D0-D3hot, which should also disable
> MSI/MSI-X.  Why doesn't this solve the problem?  Is this because the
> device causing the storm was in PCI_UNKNOWN state?

That's indeed a really good question.

Thanks,

tglx


Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-13 Thread Bjorn Helgaas
On Fri, Nov 06, 2020 at 10:14:14AM -0300, Guilherme G. Piccoli wrote:
> On 23/10/2018 14:03, Bjorn Helgaas wrote:
> > On Mon, Oct 22, 2018 at 05:35:06PM -0300, Guilherme G. Piccoli wrote:
> >> On 18/10/2018 19:15, Bjorn Helgaas wrote:
> >>> On Thu, Oct 18, 2018 at 03:37:19PM -0300, Guilherme G. Piccoli wrote:
> >>> [...] 
> >> I understand your point, but I think this is inherently an architecture
> >> problem. No matter what solution we decide for, it'll need to be applied
> >> in early boot time, like before the PCI layer gets initialized.
> > 
> > This is the part I want to know more about.  Apparently there's some
> > event X between early_quirks() and the PCI device enumeration, and we
> > must disable MSIs before X:
> > 
> >   setup_arch()
> >   early_quirks() # arch/x86/kernel/early-quirks.c
> >   early_pci_clear_msi()
> >   ...
> >   X
> >   ...
> >   pci_scan_root_bus_bridge()
> > ...
> > DECLARE_PCI_FIXUP_EARLY  # drivers/pci/quirks.c
> > 
> > I want to know specifically what X is.  If we don't know what X is and
> > all we know is "we have to disable MSIs earlier than PCI init", then
> > we're likely to break things again in the future by changing the order
> > of disabling MSIs and whatever X is.
> 
> Hi Bjorn (and all CCed), I'm sorry to necro-bump a thread >2 years
> later, but recent discussions led to a better understanding of this 'X'
> point, thanks to Thomas!
> 
> For those that deleted this thread from their email clients, it's
> available in [0] - the summary is that we faced an IRQ storm really
> early in boot, due to a bogus PCIe device MSI behavior, when booting a
> kdump kernel. This led the machine to get stuck in the boot and we
> couldn't kdump. The solution hereby proposed is to clear MSI interrupts
> early in x86, if a parameter is provided. I don't have the reproducer
> anymore and it was pretty hard to reproduce in virtual environments.
> 
> So, about the 'X' Bjorn, in another recent thread about IRQ storms [1],
> Thomas clarified that and after a brief discussion, it seems there's no
> better way to prevent the MSI storm other than clearing the MSI
> capability early in boot. As discussed both here and in thread [1], this
> is indeed a per-architecture issue (powerpc is not subject to that, due
> to a better FW reset mechanism), so I think we still could benefit in
> having this idea implemented upstream, at least in x86 (we could expand
> to other architectures if desired, in the future).
> 
> As a "test" data point, this was implemented in Ubuntu (same 3 patches
> present in this series) for ~2 years and we haven't received bug reports
> - I'm saying that because I understand your concerns about expanding the
> early PCI quirks scope.
> 
> Let me know your thoughts. I'd suggest all to read thread [1], which
> addresses a similar issue but in a different "moment" of the system boot
> and provides some more insight on why the early MSI clearing seems to
> make sense.

I guess Thomas' patch [2] (from thread [1]) doesn't solve this
problem?

I think [0] proposes using early_quirks() to disable MSIs at
boot-time.  That doesn't seem like a robust solution because (a) the
problem affects all arches but early_quirks() is x86-specific and (b)
even on x86 early_quirks() only works for PCI segment 0 because it
relies on the 0xCF8/0xCFC I/O ports.

If I understand Thomas' email correctly, the IRQ storm occurs here:

  start_kernel
setup_arch
  early_quirks   # x86-only
...
  read_pci_config_16(num, slot, func, PCI_VENDOR_ID)
outl(..., 0xcf8) # PCI segment 0 only
inw(0xcfc)
local_irq_enable
  ...
native_irq_enable
  asm("sti") # <-- enable IRQ, storm occurs

native_irq_enable() happens long before we discover PCI host bridges
and run the normal PCI quirks, so those would be too late to disable
MSIs.

It doesn't seem practical to disable MSIs in the kdump kernel at the
PCI level.  I was hoping we could disable them somewhere in the IRQ
code, e.g., at IOAPICs, but I think Thomas is saying that's not
feasible.

It seems like the only option left is to disable MSIs before the
kexec.  We used to clear the MSI/MSI-X Enable bits in
pci_device_shutdown(), but that broke console devices that relied on
MSI and caused "nobody cared" warnings when the devices fell back to
using INTx, so fda78d7a0ead ("PCI/MSI: Stop disabling MSI/MSI-X in
pci_device_shutdown()") left them unchanged.

pci_device_shutdown() still clears the Bus Master Enable bit if we're
doing a kexec and the device is in D0-D3hot, which should also disable
MSI/MSI-X.  Why doesn't this solve the problem?  Is this because the
device causing the storm was in PCI_UNKNOWN state?

> [0] 
> https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpicc...@canonical.com
> 
> [1] https://lore.kernel.org/lkml/87y2js3ghv@nanos.tec.linutronix.de

[2] 

Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-06 Thread Guilherme G. Piccoli
On 23/10/2018 14:03, Bjorn Helgaas wrote:
> On Mon, Oct 22, 2018 at 05:35:06PM -0300, Guilherme G. Piccoli wrote:
>> On 18/10/2018 19:15, Bjorn Helgaas wrote:
>>> On Thu, Oct 18, 2018 at 03:37:19PM -0300, Guilherme G. Piccoli wrote:
>>> [...] 
>> I understand your point, but I think this is inherently an architecture
>> problem. No matter what solution we decide for, it'll need to be applied
>> in early boot time, like before the PCI layer gets initialized.
> 
> This is the part I want to know more about.  Apparently there's some
> event X between early_quirks() and the PCI device enumeration, and we
> must disable MSIs before X:
> 
>   setup_arch()
>   early_quirks() # arch/x86/kernel/early-quirks.c
>   early_pci_clear_msi()
>   ...
>   X
>   ...
>   pci_scan_root_bus_bridge()
> ...
> DECLARE_PCI_FIXUP_EARLY  # drivers/pci/quirks.c
> 
> I want to know specifically what X is.  If we don't know what X is and
> all we know is "we have to disable MSIs earlier than PCI init", then
> we're likely to break things again in the future by changing the order
> of disabling MSIs and whatever X is.
> 
> Bjorn
> 

Hi Bjorn (and all CCed), I'm sorry to necro-bump a thread >2 years
later, but recent discussions led to a better understanding of this 'X'
point, thanks to Thomas!

For those that deleted this thread from their email clients, it's
available in [0] - the summary is that we faced an IRQ storm really
early in boot, due to a bogus PCIe device MSI behavior, when booting a
kdump kernel. This led the machine to get stuck in the boot and we
couldn't kdump. The solution hereby proposed is to clear MSI interrupts
early in x86, if a parameter is provided. I don't have the reproducer
anymore and it was pretty hard to reproduce in virtual environments.

So, about the 'X' Bjorn, in another recent thread about IRQ storms [1],
Thomas clarified that and after a brief discussion, it seems there's no
better way to prevent the MSI storm other than clearing the MSI
capability early in boot. As discussed both here and in thread [1], this
is indeed a per-architecture issue (powerpc is not subject to that, due
to a better FW reset mechanism), so I think we still could benefit in
having this idea implemented upstream, at least in x86 (we could expand
to other architectures if desired, in the future).

As a "test" data point, this was implemented in Ubuntu (same 3 patches
present in this series) for ~2 years and we haven't received bug reports
- I'm saying that because I understand your concerns about expanding the
early PCI quirks scope.

Let me know your thoughts. I'd suggest all to read thread [1], which
addresses a similar issue but in a different "moment" of the system boot
and provides some more insight on why the early MSI clearing seems to
make sense.

Thanks,


Guilherme


[0]
https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpicc...@canonical.com

[1] https://lore.kernel.org/lkml/87y2js3ghv@nanos.tec.linutronix.de