Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures
On Wed, 27 Jan 2021, Bjorn Helgaas wrote: > > Any new news on this? Disabling "tlp" (which just shifts the problem around > > on my machine) shouldn't be a solution for this issue. > > Agreed; disabling "tlp" is a workaround but not a solution. Actually, disabling "tlp" doesn't fix the issue; I'd tested this and if IIRC, if I don't use tlp it doesn't prevent this from happening, it just shifts it from breaking on hibernate cycles to suspend/resume cycles instead. -Kenny -- Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Orange County CA
Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures
On Wed, 27 Jan 2021, Bjorn Helgaas wrote: > Do you have a URL for your initial report that I could include in the > revert commit log? I don't, as I'd emailed the committers first and that was then CCed to the mailing list, but here's what I'd sent: Date: Fri, 25 Dec 2020 16:38:56 From: Kenneth R. Crudup To: vid...@nvidia.com Cc: bhelg...@google.com Subject: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures I've been running Linus' master branch on my laptop (Dell XPS 13 2-in-1). With this commit in place, after resuming from hibernate my machine is essentially useless, with a torrent of disk I/O errors on my NVMe device (at least, and possibly other devices affected) until a reboot. I do use tlp to set the PCIe ASPM to "performance" on AC and "powersupersave" on battery. Let me know if you need more information. -Kenny -- Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Orange County CA
Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures
> > From: Kenneth R. Crudup > > I've been running Linus' master branch on my laptop (Dell XPS 13 > > 2-in-1). With this commit in place, after resuming from hibernate > > my machine is essentially useless, with a torrent of disk I/O errors > > on my NVMe device (at least, and possibly other devices affected) > > until a reboot. > > > > I do use tlp to set the PCIe ASPM to "performance" on AC and > > "powersupersave" on battery. On Sun, 27 Dec 2020, Bjorn Helgaas wrote: > Thanks a lot for the report, and sorry for the breakage. > 4257f7e008ea restores PCI_L1SS_CTL1, then PCI_L1SS_CTL2. I think it > should do those in the reverse order, since the Enable bits are in > PCI_L1SS_CTL1. It also restores L1SS state (potentially enabling > L1.x) before we restore the PCIe Capability (potentially enabling ASPM > as a whole). Those probably should also be in the other order. Any new news on this? Disabling "tlp" (which just shifts the problem around on my machine) shouldn't be a solution for this issue. I'd thought it may have been tied to some of the PM regressions of the last week of December, but all of those have been fixed but this still remains. Thanks, -Kenny -- Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Orange County CA
Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures
On Sun, 27 Dec 2020, Kenneth R. Crudup wrote: > I'll try your patch after the revert and see if anything changes. I just realized today's patch makes no sense if it's reverted, so nevermind. -Kenny -- Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Orange County CA
Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures
OK, got more info: On Sun, 27 Dec 2020, Bjorn Helgaas wrote: > If it's convenient, can you try the patch below? If the debug patch > doesn't help: > - Are you seeing the hibernate/resume problem when on AC or on > battery? OK, so: - on TLP, before your patch, it panic()s on AC, but not on battery - on TLP, with your patch, it panic()s on battery, but NOT on AC - if TLP is masked, it still panic()s, but I forget if AC or battery - even if I mask TLP, with your commit in place it panic()s > - If you revert 4257f7e008ea, does hibernate/resume work fine? Both > with the tlp tweak and without? Definitely with the revert everything works. I'll try your patch after the revert and see if anything changes. -Kenny -- Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Orange County CA
Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures
On Sun, 27 Dec 2020, Bjorn Helgaas wrote: > If it's convenient, can you try the patch below? Will do! Also: > - Are you seeing the hibernate/resume problem when on AC or on > battery? Um, I forget :) but want to say "both". I'll try both ways and let you know. > - If you revert 4257f7e008ea, does hibernate/resume work fine? Both > with the tlp tweak and without? Yeah, but TBH there were two other PM regressions in this -rc cycle, so you guys are in good company :) > - Collect "sudo lspci -vv" output before hibernate and (if possible) > after resume when you see the problem. See attached. > I guess tlp only uses /sys/module/pcie_aspm/parameters/policy, so it > sets the same ASPM policy system-wide. Yeah. -Kenny -- Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Orange County CA00:00.0 Host bridge [0600]: Intel Corporation Device [8086:8a12] (rev 03) Subsystem: Dell Device [1028:08b0] Flags: bus master, fast devsel, latency 0, IOMMU group 0 Capabilities: [e0] Vendor Specific Information: Len=10 Kernel driver in use: icl_uncore 00:02.0 VGA compatible controller [0300]: Intel Corporation Iris Plus Graphics G7 [8086:8a52] (rev 07) (prog-if 00 [VGA controller]) DeviceName: To Be Filled by O.E.M. Subsystem: Dell Iris Plus Graphics G7 [1028:08b0] Flags: bus master, fast devsel, latency 0, IRQ 201, IOMMU group 1 Memory at 603d00 (64-bit, non-prefetchable) [size=16M] Memory at 40 (64-bit, prefetchable) [size=256M] I/O ports at 4000 [size=64] Expansion ROM at 000c [virtual] [disabled] [size=128K] Capabilities: [40] Vendor Specific Information: Len=0c Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00 Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit- Capabilities: [d0] Power Management version 2 Capabilities: [100] Process Address Space ID (PASID) Capabilities: [200] Address Translation Service (ATS) Capabilities: [300] Page Request Interface (PRI) Kernel driver in use: i915 Kernel modules: i915 00:04.0 Signal processing controller [1180]: Intel Corporation Device [8086:8a03] (rev 03) Subsystem: Dell Device [1028:08b0] Flags: fast devsel, IRQ 16, IOMMU group 2 Memory at 603eba (64-bit, non-prefetchable) [size=64K] Capabilities: [90] MSI: Enable- Count=1/1 Maskable- 64bit- Capabilities: [d0] Power Management version 3 Capabilities: [e0] Vendor Specific Information: Len=0c Kernel driver in use: proc_thermal 00:05.0 Multimedia controller [0480]: Intel Corporation Device [8086:8a19] (rev 03) Subsystem: Dell Device [1028:08b0] Flags: fast devsel, IRQ 255, IOMMU group 3 Memory at 603c00 (64-bit, non-prefetchable) [disabled] [size=16M] Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00 Capabilities: [ac] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [d0] Power Management version 3 00:07.0 PCI bridge [0604]: Intel Corporation Ice Lake Thunderbolt 3 PCI Express Root Port #0 [8086:8a1d] (rev 03) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0, IRQ 155, IOMMU group 4 Bus: primary=00, secondary=01, subordinate=2d, sec-latency=0 I/O behind bridge: 5000-6fff [size=8K] Memory behind bridge: 7e00-8a1f [size=194M] Prefetchable memory behind bridge: 0060-00601bff [size=448M] Capabilities: [40] Express Root Port (Slot+), MSI 00 Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit- Capabilities: [90] Subsystem: Device [:] Capabilities: [a0] Power Management version 3 Capabilities: [100] Null Capabilities: [220] Access Control Services Capabilities: [a00] Downstream Port Containment Kernel driver in use: pcieport 00:07.2 PCI bridge [0604]: Intel Corporation Ice Lake Thunderbolt 3 PCI Express Root Port #2 [8086:8a21] (rev 03) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0, IRQ 156, IOMMU group 5 Bus: primary=00, secondary=2e, subordinate=58, sec-latency=0 I/O behind bridge: 7000-7fff [size=4K] Memory behind bridge: 7000-7c1f [size=194M] Prefetchable memory behind bridge: 00602000-00603bff [size=448M] Capabilities: [40] Express Root Port (Slot+), MSI 00 Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit- Capabilities: [90] Subsystem: Device [:] Capabilities: [a0] Power Management version 3 Capabilities: [100] Null Capabilities: [220] Access Control Services Capabilities: [a00] Downstream Port Containment Kernel driver in use: pcieport 00:0d.0 US
Re: [PATCH] tpm: efi: Don't create binary_bios_measurements file for an empty log
On Wed, 28 Oct 2020, Tyler Hicks wrote: > Hi Kai, Kenneth, and Mimi - could one or two of you please follow these > steps: ... OK, I built a kernel with the two patches applied, and with and without commit 7f3d176f5f7e, and when I run "sudo fwupdtpmevlog" (which normally causes a panic() when 7f3d176f5f7e is in place) and the results are: - With 7f3d176f5f7e reverted, no WARNs of any kind - With 7f3d176f5f7e in place, same, I just get "UEFI TPM log area empty" -Kenny -- Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Orange County CA
Re: [Regression] "tpm: Require that all digests are present in TCG_PCR_EVENT2 structures" causes null pointer dereference
On Mon, 28 Sep 2020, Jarkko Sakkinen wrote: > That's a good guess. Just a bit confused how that particular patch can > have the effect: it has two deferences to efispecid instead of one in > the same statement. Would be interesting to hear if the bug is triggered > in Kenneth's environment by the exact same commit. I can confirm that reverting 7f3d176f5f stops my kernel from panic()ing: $ sudo fwupdtpmevlog [sudo] password for kenny: Failed to parse file: attempted to read 0x10 bytes from buffer of 0x00 $ It would normally OOPS at that point. -Kenny -- Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Orange County CA
Re: [PATCH] ACPI: OSL: Prevent acpi_release_memory() from returning too early
Works here for me: $ l /sys/class/typec total 0 10012 0 drwxr-xr-x 2 root root 0 Aug 21 11:25 ./ 10 0 drwxr-xr-x 67 root root 0 Aug 21 11:25 ../ 27771 0 lrwxrwxrwx 1 root root 0 Aug 21 11:25 port0 -> ../../devices/platform/USBC000:00/typec/port0/ 35601 0 lrwxrwxrwx 1 root root 0 Aug 21 11:25 port0-partner -> ../../devices/platform/USBC000:00/typec/port0/port0-partner/ 36686 0 lrwxrwxrwx 1 root root 0 Aug 21 11:25 port1 -> ../../devices/platform/USBC000:00/typec/port1/ $ On Fri, 21 Aug 2020, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki > > After commit 1757659d022b ("ACPI: OSL: Implement deferred unmapping > of ACPI memory") in some cases acpi_release_memory() may return > before the target memory mappings actually go away, because they > are released asynchronously now. > > Prevent it from returning prematurely by making it wait for the next > RCU grace period to elapse, for all of the RCU callbacks to complete > and for all of the scheduled work items to be flushed before > returning. > > Fixes: 1757659d022b ("ACPI: OSL: Implement deferred unmapping of ACPI memory") > Reported-by: Kenneth R. Crudup > Signed-off-by: Rafael J. Wysocki > --- > drivers/acpi/osl.c | 19 +-- > 1 file changed, 17 insertions(+), 2 deletions(-) > > Index: linux-pm/drivers/acpi/osl.c > === > --- linux-pm.orig/drivers/acpi/osl.c > +++ linux-pm/drivers/acpi/osl.c > @@ -1575,11 +1575,26 @@ static acpi_status acpi_deactivate_mem_r > acpi_status acpi_release_memory(acpi_handle handle, struct resource *res, > u32 level) > { > + acpi_status status; > + > if (!(res->flags & IORESOURCE_MEM)) > return AE_TYPE; > > - return acpi_walk_namespace(ACPI_TYPE_REGION, handle, level, > -acpi_deactivate_mem_region, NULL, res, NULL); > + status = acpi_walk_namespace(ACPI_TYPE_REGION, handle, level, > + acpi_deactivate_mem_region, NULL, > + res, NULL); > + if (ACPI_FAILURE(status)) > + return status; > + > + /* > + * Wait for all of the mappings queued up for removal by > + * acpi_deactivate_mem_region() to actually go away. > + */ > + synchronize_rcu(); > + rcu_barrier(); > + flush_scheduled_work(); > + > + return AE_OK; > } > EXPORT_SYMBOL_GPL(acpi_release_memory); > > > > > -- Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Orange County CA
Re: Commit 25f12ae45fc1 ("maccess: rename probe_kernel_address to get_kernel_nofault") causing several OOPSes
On Sun, 21 Jun 2020, Linus Torvalds wrote: > For me that patch makes no difference. So ... it looks like this original issue (random crashes if this commit was made and the ROMs thing isn't patched) is a red herring. For some time now, my Thunderbolt adapter (Lenovo 2nd-Gen TB dock) sometimes ... "acts up" on reboots/powerons with my laptop (Dell XPS 13 2-in-1) if I boot with it connected- I'll get random disconnects and/or lack of external video and/or its internal devices will drop offline. I suspect these crashes were related to the PCIe scribbling(?) somewhere(?) (even though I have IOMMU turned on). I can't track it down, and I think the issue may lie somewhere either with the Thunderbolt subsystem (my TB domain UUID gets regenerated on every boot and is always changing) or a bug in this Dell's BIOS (probably the likely culprit). ... in any case, I booted a fresh branch that does contain the commit in the Subject: line, and w/o Christoph's ROMs patch and as long as I don't boot with the TB adapter attached (or from a cold boot- SOMEtimes) it seems to work. So, my apologies for wasting everyone's time. -Kenny -- Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Orange County CA
Re: Commit 25f12ae45fc1 ("maccess: rename probe_kernel_address to get_kernel_nofault") causing several OOPSes
On Sat, 20 Jun 2020, Linus Torvalds wrote: > So yeah, I'd like to see what the code generation difference is for > you, since it seems to matter to your install for some odd reason. I could send up the .o if you'd think it would be helpful. LMK. (Eh, I'll just do it, and the .cmd as well). -Kenny -- Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Orange County CA .probe_roms.o.cmd Description: Binary data probe_roms.o Description: application/object
Re: Commit 25f12ae45fc1 ("maccess: rename probe_kernel_address to get_kernel_nofault") causing several OOPSes
On Sat, Jun 20, 2020 at 6:46 AM Kenneth R. Crudup wrote: > > So, be totally surprised :) I've just booted with "maccess: rename > > probe_kernel_address to get_kernel_nofault" intact and your probe_roms.c > > patch with no issues. > > (Perhaps there's some sort of compiler optimization going on?) On Sat, 20 Jun 2020, Linus Torvalds wrote: > I'm staring at that opatch and not seeing how it could _possibly_ make > any difference in code generation. > Which is the obvious next step: would you mind compiling that file > with and without the patch and sending me the two object files? It looks like you had already, do you still need me to do this? FWIW, here's my gcc info: $ gcc --version gcc (Ubuntu 9.3.0-13ubuntu1) 9.3.0 OH- I did change arch/x86/Makefile in my own builds- maybe this could matter? Doubtful, but I could test later tonight or tomorrow (gotta do some ;l work in the meantime). diff --git a/arch/x86/Makefile b/arch/x86/Makefile index 00e378de8bc0..37aff76f3067 100644 --- a/arch/x86/Makefile +++ b/arch/x86/Makefile @@ -123,7 +123,8 @@ else cflags-$(CONFIG_MPSC) += $(call cc-option,-march=nocona) cflags-$(CONFIG_MCORE2) += \ -$(call cc-option,-march=core2,$(call cc-option,-mtune=generic)) +$(call cc-option,-march=icelake-client,$(call cc-option,-mtune=native)) \ +$(call cc-option,-mtune=icelake-client,$(call cc-option,-mtune=native)) cflags-$(CONFIG_MATOM) += $(call cc-option,-march=atom) \ $(call cc-option,-mtune=atom,$(call cc-option,-mtune=generic)) cflags-$(CONFIG_GENERIC_CPU) += $(call cc-option,-mtune=generic) -Kenny -- Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Orange County CA
Re: Commit 25f12ae45fc1 ("maccess: rename probe_kernel_address to get_kernel_nofault") causing several OOPSes
> > Thing is, there's other examples of the previous version in the kernel > > tree- any > > chance there's a usage conflict (Thunderbolt has a ROM in it, maybe > > something in > > "probe_roms.c"? (Just guessing, no idea): On Fri, 19 Jun 2020, Christoph Hellwig wrote: > Maybe. But nothing looks strange there. Just to re-reconfirm, you had to > revert "maccess: rename probe_kernel_address to get_kernel_nofault", > "maccess: make get_kernel_nofault() check for minimal type compatibility" > wasn't enough? Yeah, the only commit I had to revert was this one. BUT: > Below is a patch to do a "partial revert" for probe_roms.c. I'd be > totally surprised if it changes anything from staring at it for while, > but anyway.. So, be totally surprised :) I've just booted with "maccess: rename probe_kernel_address to get_kernel_nofault" intact and your probe_roms.c patch with no issues. (Perhaps there's some sort of compiler optimization going on?) -Kenny -- Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Orange County CA
Re: Commit 25f12ae45fc1 ("maccess: rename probe_kernel_address to get_kernel_nofault") causing several OOPSes
On Fri, 19 Jun 2020, Christoph Hellwig wrote: > Just to re-reconfirm, you had to > revert "maccess: rename probe_kernel_address to get_kernel_nofault", > "maccess: make get_kernel_nofault() check for minimal type compatibility" > wasn't enough? Yeah. the bisect turned up the commit in the subject: and reverting it (only) stopped the crashes. This was this afternoon's (US PST) Linus' master. > Below is a patch to do a "partial revert" for probe_roms.c. I'll give it a shot later, thanks. -Kenny -- Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Orange County CA
Re: Commit 25f12ae45fc1 ("maccess: rename probe_kernel_address to get_kernel_nofault") causing several OOPSes
On Fri, 19 Jun 2020, Christoph Hellwig wrote: > That is indeed really strange, as that commit is just a rename. > Well, Linus also added swapping of the argument order, but again it > shouldn't change much. Thing is, there's other examples of the previous version in the kernel tree- any chance there's a usage conflict (Thunderbolt has a ROM in it, maybe something in "probe_roms.c"? (Just guessing, no idea): afind probe_kernel_address ./lib/test_lockup.c:probe_kernel_address(ptr, buf) || ./lib/test_lockup.c:probe_kernel_address(ptr + size - 1, buf)) { ./lib/test_lockup.c:if (probe_kernel_address(ptr, magic) || magic != expected) { ./arch/arm64/kernel/traps.c:if (probe_kernel_address((__force __le32 *)pc, instr_le)) ./arch/sh/kernel/traps.c: if (probe_kernel_address((insn_size_t *)addr, opcode)) ./arch/x86/kernel/traps.c: if (probe_kernel_address((unsigned short *)addr, ud)) ./arch/x86/kernel/probe_roms.c: if (probe_kernel_address(rom_list, device) != 0) ./arch/x86/kernel/probe_roms.c: if (probe_kernel_address(rom + 0x18, offset) != 0) ./arch/x86/kernel/probe_roms.c: if (probe_kernel_address(rom + offset + 0x4, vendor) != 0) ./arch/x86/kernel/probe_roms.c: if (probe_kernel_address(rom + offset + 0x6, device) != 0) ./arch/x86/kernel/probe_roms.c: if (probe_kernel_address(rom + offset + 0x8, list) == 0 && ./arch/x86/kernel/probe_roms.c: probe_kernel_address(rom + offset + 0xc, rev) == 0 && ./arch/x86/kernel/probe_roms.c: return probe_kernel_address(ptr, sig) == 0 && sig == ROMSIGNATURE; ./arch/x86/kernel/probe_roms.c: for (sum = 0; length && probe_kernel_address(rom++, c) == 0; length--) ./arch/x86/kernel/probe_roms.c: if (probe_kernel_address(rom + 2, c) != 0) ./arch/x86/kernel/probe_roms.c: if (probe_kernel_address(rom + 2, c) != 0) ./arch/x86/mm/fault.c: if (probe_kernel_address(instr, opcode)) ./arch/x86/mm/fault.c: if (probe_kernel_address(instr, opcode)) ./arch/x86/mm/fault.c: return probe_kernel_address((unsigned long *)p, dummy); ./arch/x86/pci/pcbios.c:if (probe_kernel_address(>fields.signature, sig)) ./arch/arm/mm/alignment.c: fault = probe_kernel_address(ip, instr); ./arch/arm/mm/alignment.c: fault = probe_kernel_address(ip, instr); ./arch/s390/mm/fault.c: return probe_kernel_address((unsigned long *)p, dummy); ./arch/powerpc/kernel/process.c:probe_kernel_address((const void *)pc, instr)) { ./arch/powerpc/kernel/kprobes.c:if (probe_kernel_address(addr, instr)) ./arch/powerpc/sysdev/fsl_pci.c:ret = probe_kernel_address((void *)regs->nip, inst); ./arch/riscv/kernel/kgdb.c: if (probe_kernel_address((void *)pc, op_code)) ./arch/riscv/kernel/kgdb.c: error = probe_kernel_address((void *)addr, stepped_opcode); ./arch/riscv/kernel/traps.c:if (probe_kernel_address((bug_insn_t *)pc, insn)) ./arch/riscv/kernel/traps.c:if (probe_kernel_address((bug_insn_t *)pc, insn)) > Do you see any compiler warnings or something > odd in the kernel log before the actual crash? Not that I could see, but I'll try building again later on. -Kenny -- Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Silicon Valley