Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures
On Wed, 27 Jan 2021, Bjorn Helgaas wrote: > > Any new news on this? Disabling "tlp" (which just shifts the problem around > > on my machine) shouldn't be a solution for this issue. > > Agreed; disabling "tlp" is a workaround but not a solution. Actually, disabling "tlp" doesn't fix the issue; I'd tested this and if IIRC, if I don't use tlp it doesn't prevent this from happening, it just shifts it from breaking on hibernate cycles to suspend/resume cycles instead. -Kenny -- Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Orange County CA
Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures
On Wed, 27 Jan 2021, Bjorn Helgaas wrote: > Do you have a URL for your initial report that I could include in the > revert commit log? I don't, as I'd emailed the committers first and that was then CCed to the mailing list, but here's what I'd sent: Date: Fri, 25 Dec 2020 16:38:56 From: Kenneth R. Crudup To: vid...@nvidia.com Cc: bhelg...@google.com Subject: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures I've been running Linus' master branch on my laptop (Dell XPS 13 2-in-1). With this commit in place, after resuming from hibernate my machine is essentially useless, with a torrent of disk I/O errors on my NVMe device (at least, and possibly other devices affected) until a reboot. I do use tlp to set the PCIe ASPM to "performance" on AC and "powersupersave" on battery. Let me know if you need more information. -Kenny -- Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Orange County CA
Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures
On Fri, Jan 22, 2021 at 12:11:08PM -0800, Kenneth R. Crudup wrote: > > > From: Kenneth R. Crudup > > > I've been running Linus' master branch on my laptop (Dell XPS 13 > > > 2-in-1). With this commit in place, after resuming from hibernate > > > my machine is essentially useless, with a torrent of disk I/O errors > > > on my NVMe device (at least, and possibly other devices affected) > > > until a reboot. > > > > > > I do use tlp to set the PCIe ASPM to "performance" on AC and > > > "powersupersave" on battery. > > On Sun, 27 Dec 2020, Bjorn Helgaas wrote: > > > Thanks a lot for the report, and sorry for the breakage. > > 4257f7e008ea restores PCI_L1SS_CTL1, then PCI_L1SS_CTL2. I think it > > should do those in the reverse order, since the Enable bits are in > > PCI_L1SS_CTL1. It also restores L1SS state (potentially enabling > > L1.x) before we restore the PCIe Capability (potentially enabling ASPM > > as a whole). Those probably should also be in the other order. > > Any new news on this? Disabling "tlp" (which just shifts the problem around > on my machine) shouldn't be a solution for this issue. Agreed; disabling "tlp" is a workaround but not a solution. > I'd thought it may have been tied to some of the PM regressions of the last > week of December, but all of those have been fixed but this still remains. I haven't seen anything yet and haven't had a chance to look into it more myself. We're at v5.11-rc5 already, so I guess we'll have to think about reverting 4257f7e008ea ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") before v5.11-final unless we can make some progress. That would mean ASPM L1 substate configuration would be lost by a suspend/resume, so we'd give up some power saving. But that's better than the regression you're seeing. I'll tentatively queue up a revert on for-linus pending progress on a better fix. For some reason I can't find your initial report of the regression. The first thing I can find is this: https://lore.kernel.org/linux-pci/20201228040513.GA611645@bjorn-Precision-5520/ Do you have a URL for your initial report that I could include in the revert commit log? Bjorn
Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures
> > From: Kenneth R. Crudup > > I've been running Linus' master branch on my laptop (Dell XPS 13 > > 2-in-1). With this commit in place, after resuming from hibernate > > my machine is essentially useless, with a torrent of disk I/O errors > > on my NVMe device (at least, and possibly other devices affected) > > until a reboot. > > > > I do use tlp to set the PCIe ASPM to "performance" on AC and > > "powersupersave" on battery. On Sun, 27 Dec 2020, Bjorn Helgaas wrote: > Thanks a lot for the report, and sorry for the breakage. > 4257f7e008ea restores PCI_L1SS_CTL1, then PCI_L1SS_CTL2. I think it > should do those in the reverse order, since the Enable bits are in > PCI_L1SS_CTL1. It also restores L1SS state (potentially enabling > L1.x) before we restore the PCIe Capability (potentially enabling ASPM > as a whole). Those probably should also be in the other order. Any new news on this? Disabling "tlp" (which just shifts the problem around on my machine) shouldn't be a solution for this issue. I'd thought it may have been tied to some of the PM regressions of the last week of December, but all of those have been fixed but this still remains. Thanks, -Kenny -- Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Orange County CA
Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures
Ideally Bjorn's patch should have worked. Could you please collect 'sudo lspci -vv' (please don't forget to give sudo) with Bjorn's patch before and after hibernate? Also, is it right to say that with policy set to "performance" there is no issue during hibernate/resume? - Vidya Sagar On 12/28/2020 12:00 PM, Kenneth R. Crudup wrote: External email: Use caution opening links or attachments On Sun, 27 Dec 2020, Kenneth R. Crudup wrote: I'll try your patch after the revert and see if anything changes. I just realized today's patch makes no sense if it's reverted, so nevermind. -Kenny -- Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Orange County CA
Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures
On Sun, 27 Dec 2020, Kenneth R. Crudup wrote: > I'll try your patch after the revert and see if anything changes. I just realized today's patch makes no sense if it's reverted, so nevermind. -Kenny -- Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Orange County CA
Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures
OK, got more info: On Sun, 27 Dec 2020, Bjorn Helgaas wrote: > If it's convenient, can you try the patch below? If the debug patch > doesn't help: > - Are you seeing the hibernate/resume problem when on AC or on > battery? OK, so: - on TLP, before your patch, it panic()s on AC, but not on battery - on TLP, with your patch, it panic()s on battery, but NOT on AC - if TLP is masked, it still panic()s, but I forget if AC or battery - even if I mask TLP, with your commit in place it panic()s > - If you revert 4257f7e008ea, does hibernate/resume work fine? Both > with the tlp tweak and without? Definitely with the revert everything works. I'll try your patch after the revert and see if anything changes. -Kenny -- Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Orange County CA
Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures
On Sun, 27 Dec 2020, Bjorn Helgaas wrote: > If it's convenient, can you try the patch below? Will do! Also: > - Are you seeing the hibernate/resume problem when on AC or on > battery? Um, I forget :) but want to say "both". I'll try both ways and let you know. > - If you revert 4257f7e008ea, does hibernate/resume work fine? Both > with the tlp tweak and without? Yeah, but TBH there were two other PM regressions in this -rc cycle, so you guys are in good company :) > - Collect "sudo lspci -vv" output before hibernate and (if possible) > after resume when you see the problem. See attached. > I guess tlp only uses /sys/module/pcie_aspm/parameters/policy, so it > sets the same ASPM policy system-wide. Yeah. -Kenny -- Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Orange County CA00:00.0 Host bridge [0600]: Intel Corporation Device [8086:8a12] (rev 03) Subsystem: Dell Device [1028:08b0] Flags: bus master, fast devsel, latency 0, IOMMU group 0 Capabilities: [e0] Vendor Specific Information: Len=10 Kernel driver in use: icl_uncore 00:02.0 VGA compatible controller [0300]: Intel Corporation Iris Plus Graphics G7 [8086:8a52] (rev 07) (prog-if 00 [VGA controller]) DeviceName: To Be Filled by O.E.M. Subsystem: Dell Iris Plus Graphics G7 [1028:08b0] Flags: bus master, fast devsel, latency 0, IRQ 201, IOMMU group 1 Memory at 603d00 (64-bit, non-prefetchable) [size=16M] Memory at 40 (64-bit, prefetchable) [size=256M] I/O ports at 4000 [size=64] Expansion ROM at 000c [virtual] [disabled] [size=128K] Capabilities: [40] Vendor Specific Information: Len=0c Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00 Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit- Capabilities: [d0] Power Management version 2 Capabilities: [100] Process Address Space ID (PASID) Capabilities: [200] Address Translation Service (ATS) Capabilities: [300] Page Request Interface (PRI) Kernel driver in use: i915 Kernel modules: i915 00:04.0 Signal processing controller [1180]: Intel Corporation Device [8086:8a03] (rev 03) Subsystem: Dell Device [1028:08b0] Flags: fast devsel, IRQ 16, IOMMU group 2 Memory at 603eba (64-bit, non-prefetchable) [size=64K] Capabilities: [90] MSI: Enable- Count=1/1 Maskable- 64bit- Capabilities: [d0] Power Management version 3 Capabilities: [e0] Vendor Specific Information: Len=0c Kernel driver in use: proc_thermal 00:05.0 Multimedia controller [0480]: Intel Corporation Device [8086:8a19] (rev 03) Subsystem: Dell Device [1028:08b0] Flags: fast devsel, IRQ 255, IOMMU group 3 Memory at 603c00 (64-bit, non-prefetchable) [disabled] [size=16M] Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00 Capabilities: [ac] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [d0] Power Management version 3 00:07.0 PCI bridge [0604]: Intel Corporation Ice Lake Thunderbolt 3 PCI Express Root Port #0 [8086:8a1d] (rev 03) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0, IRQ 155, IOMMU group 4 Bus: primary=00, secondary=01, subordinate=2d, sec-latency=0 I/O behind bridge: 5000-6fff [size=8K] Memory behind bridge: 7e00-8a1f [size=194M] Prefetchable memory behind bridge: 0060-00601bff [size=448M] Capabilities: [40] Express Root Port (Slot+), MSI 00 Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit- Capabilities: [90] Subsystem: Device [:] Capabilities: [a0] Power Management version 3 Capabilities: [100] Null Capabilities: [220] Access Control Services Capabilities: [a00] Downstream Port Containment Kernel driver in use: pcieport 00:07.2 PCI bridge [0604]: Intel Corporation Ice Lake Thunderbolt 3 PCI Express Root Port #2 [8086:8a21] (rev 03) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0, IRQ 156, IOMMU group 5 Bus: primary=00, secondary=2e, subordinate=58, sec-latency=0 I/O behind bridge: 7000-7fff [size=4K] Memory behind bridge: 7000-7c1f [size=194M] Prefetchable memory behind bridge: 00602000-00603bff [size=448M] Capabilities: [40] Express Root Port (Slot+), MSI 00 Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit- Capabilities: [90] Subsystem: Device [:] Capabilities: [a0] Power Management version 3 Capabilities: [100] Null Capabilities: [220] Access Control Services Capabilities: [a00] Downstream Port Containment Kernel driver in use: pcieport 00:0d.0 USB controller [0c03]: Intel Corporation Ice Lake
Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures
> From: Kenneth R. Crudup > > I've been running Linus' master branch on my laptop (Dell XPS 13 > 2-in-1). With this commit in place, after resuming from hibernate > my machine is essentially useless, with a torrent of disk I/O errors > on my NVMe device (at least, and possibly other devices affected) > until a reboot. > > I do use tlp to set the PCIe ASPM to "performance" on AC and > "powersupersave" on battery. Thanks a lot for the report, and sorry for the breakage. 4257f7e008ea restores PCI_L1SS_CTL1, then PCI_L1SS_CTL2. I think it should do those in the reverse order, since the Enable bits are in PCI_L1SS_CTL1. It also restores L1SS state (potentially enabling L1.x) before we restore the PCIe Capability (potentially enabling ASPM as a whole). Those probably should also be in the other order. If it's convenient, can you try the patch below? If the debug patch doesn't help: - Are you seeing the hibernate/resume problem when on AC or on battery? - What if you don't use tlp? Does hibernate/resume work fine then? If tlp makes a difference, can you collect "sudo lspci -vv" output with and without using tlp (before hibernate)? - If you revert 4257f7e008ea, does hibernate/resume work fine? Both with the tlp tweak and without? - Collect "sudo lspci -vv" output before hibernate and (if possible) after resume when you see the problem. I guess tlp only uses /sys/module/pcie_aspm/parameters/policy, so it sets the same ASPM policy system-wide. Bjorn diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index b9fecc25d213..6598b5cd3154 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1665,9 +1665,8 @@ void pci_restore_state(struct pci_dev *dev) * LTR itself (in the PCIe capability). */ pci_restore_ltr_state(dev); - pci_restore_aspm_l1ss_state(dev); - pci_restore_pcie_state(dev); + pci_restore_aspm_l1ss_state(dev); pci_restore_pasid_state(dev); pci_restore_pri_state(dev); pci_restore_ats_state(dev); diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c index a08e7d6dc248..c4a99274b4bb 100644 --- a/drivers/pci/pcie/aspm.c +++ b/drivers/pci/pcie/aspm.c @@ -752,8 +752,8 @@ void pci_save_aspm_l1ss_state(struct pci_dev *dev) return; cap = (u32 *)_state->cap.data[0]; - pci_read_config_dword(dev, aspm_l1ss + PCI_L1SS_CTL1, cap++); pci_read_config_dword(dev, aspm_l1ss + PCI_L1SS_CTL2, cap++); + pci_read_config_dword(dev, aspm_l1ss + PCI_L1SS_CTL1, cap++); } void pci_restore_aspm_l1ss_state(struct pci_dev *dev) @@ -774,8 +774,8 @@ void pci_restore_aspm_l1ss_state(struct pci_dev *dev) return; cap = (u32 *)_state->cap.data[0]; - pci_write_config_dword(dev, aspm_l1ss + PCI_L1SS_CTL1, *cap++); pci_write_config_dword(dev, aspm_l1ss + PCI_L1SS_CTL2, *cap++); + pci_write_config_dword(dev, aspm_l1ss + PCI_L1SS_CTL1, *cap++); } static void pcie_config_aspm_dev(struct pci_dev *pdev, u32 val)