Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures

2021-01-27 Thread Kenneth R. Crudup


On Wed, 27 Jan 2021, Bjorn Helgaas wrote:

> > Any new news on this? Disabling "tlp" (which just shifts the problem around
> > on my machine) shouldn't be a solution for this issue.
>
> Agreed; disabling "tlp" is a workaround but not a solution.

Actually, disabling "tlp" doesn't fix the issue; I'd tested this and if IIRC,
if I don't use tlp it doesn't prevent this from happening, it just shifts it
from breaking on hibernate cycles to suspend/resume cycles instead.

-Kenny

-- 
Kenneth R. Crudup  Sr. SW Engineer, Scott County Consulting, Orange County CA


Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures

2021-01-27 Thread Kenneth R. Crudup


On Wed, 27 Jan 2021, Bjorn Helgaas wrote:

> Do you have a URL for your initial report that I could include in the
> revert commit log?

I don't, as I'd emailed the committers first and that was then CCed to the
mailing list, but here's what I'd sent:


Date: Fri, 25 Dec 2020 16:38:56
From: Kenneth R. Crudup 
To: vid...@nvidia.com
Cc: bhelg...@google.com
Subject: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for 
suspend/resume") causing hibernate resume
failures

I've been running Linus' master branch on my laptop (Dell XPS 13 2-in-1). With
this commit in place, after resuming from hibernate my machine is essentially
useless, with a torrent of disk I/O errors on my NVMe device (at least, and
possibly other devices affected) until a reboot.

I do use tlp to set the PCIe ASPM to "performance" on AC and "powersupersave"
on battery.

Let me know if you need more information.

-Kenny



-- 
Kenneth R. Crudup  Sr. SW Engineer, Scott County Consulting, Orange County CA


Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures

2021-01-27 Thread Bjorn Helgaas
On Fri, Jan 22, 2021 at 12:11:08PM -0800, Kenneth R. Crudup wrote:
> > > From: Kenneth R. Crudup 
> > > I've been running Linus' master branch on my laptop (Dell XPS 13
> > > 2-in-1).  With this commit in place, after resuming from hibernate
> > > my machine is essentially useless, with a torrent of disk I/O errors
> > > on my NVMe device (at least, and possibly other devices affected)
> > > until a reboot.
> > >
> > > I do use tlp to set the PCIe ASPM to "performance" on AC and
> > > "powersupersave" on battery.
> 
> On Sun, 27 Dec 2020, Bjorn Helgaas wrote:
> 
> > Thanks a lot for the report, and sorry for the breakage.
> > 4257f7e008ea restores PCI_L1SS_CTL1, then PCI_L1SS_CTL2.  I think it
> > should do those in the reverse order, since the Enable bits are in
> > PCI_L1SS_CTL1.  It also restores L1SS state (potentially enabling
> > L1.x) before we restore the PCIe Capability (potentially enabling ASPM
> > as a whole).  Those probably should also be in the other order.
> 
> Any new news on this? Disabling "tlp" (which just shifts the problem around
> on my machine) shouldn't be a solution for this issue.

Agreed; disabling "tlp" is a workaround but not a solution.

> I'd thought it may have been tied to some of the PM regressions of the last
> week of December, but all of those have been fixed but this still remains.

I haven't seen anything yet and haven't had a chance to look into it
more myself.

We're at v5.11-rc5 already, so I guess we'll have to think about
reverting 4257f7e008ea ("PCI/ASPM: Save/restore L1SS Capability for
suspend/resume") before v5.11-final unless we can make some progress.

That would mean ASPM L1 substate configuration would be lost by a
suspend/resume, so we'd give up some power saving.  But that's better
than the regression you're seeing.

I'll tentatively queue up a revert on for-linus pending progress on a
better fix.  For some reason I can't find your initial report of the
regression.  The first thing I can find is this:

https://lore.kernel.org/linux-pci/20201228040513.GA611645@bjorn-Precision-5520/

Do you have a URL for your initial report that I could include in the
revert commit log?

Bjorn


Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures

2021-01-22 Thread Kenneth R. Crudup


> > From: Kenneth R. Crudup 
> > I've been running Linus' master branch on my laptop (Dell XPS 13
> > 2-in-1).  With this commit in place, after resuming from hibernate
> > my machine is essentially useless, with a torrent of disk I/O errors
> > on my NVMe device (at least, and possibly other devices affected)
> > until a reboot.
> >
> > I do use tlp to set the PCIe ASPM to "performance" on AC and
> > "powersupersave" on battery.

On Sun, 27 Dec 2020, Bjorn Helgaas wrote:

> Thanks a lot for the report, and sorry for the breakage.
> 4257f7e008ea restores PCI_L1SS_CTL1, then PCI_L1SS_CTL2.  I think it
> should do those in the reverse order, since the Enable bits are in
> PCI_L1SS_CTL1.  It also restores L1SS state (potentially enabling
> L1.x) before we restore the PCIe Capability (potentially enabling ASPM
> as a whole).  Those probably should also be in the other order.

Any new news on this? Disabling "tlp" (which just shifts the problem around
on my machine) shouldn't be a solution for this issue.

I'd thought it may have been tied to some of the PM regressions of the last
week of December, but all of those have been fixed but this still remains.

Thanks,

-Kenny

-- 
Kenneth R. Crudup  Sr. SW Engineer, Scott County Consulting, Orange County CA


Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures

2020-12-29 Thread Vidya Sagar

Ideally Bjorn's patch should have worked.
Could you please collect 'sudo lspci -vv' (please don't forget to give 
sudo) with Bjorn's patch before and after hibernate?
Also, is it right to say that with policy set to "performance" there is 
no issue during hibernate/resume?


- Vidya Sagar

On 12/28/2020 12:00 PM, Kenneth R. Crudup wrote:

External email: Use caution opening links or attachments


On Sun, 27 Dec 2020, Kenneth R. Crudup wrote:


I'll try your patch after the revert and see if anything changes.


I just realized today's patch makes no sense if it's reverted, so nevermind.

 -Kenny

--
Kenneth R. Crudup  Sr. SW Engineer, Scott County Consulting, Orange County CA



Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures

2020-12-27 Thread Kenneth R. Crudup


On Sun, 27 Dec 2020, Kenneth R. Crudup wrote:

> I'll try your patch after the revert and see if anything changes.

I just realized today's patch makes no sense if it's reverted, so nevermind.

-Kenny

-- 
Kenneth R. Crudup  Sr. SW Engineer, Scott County Consulting, Orange County CA


Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures

2020-12-27 Thread Kenneth R. Crudup


OK, got more info:

On Sun, 27 Dec 2020, Bjorn Helgaas wrote:

> If it's convenient, can you try the patch below?  If the debug patch
> doesn't help:

>   - Are you seeing the hibernate/resume problem when on AC or on
> battery?

OK, so:

- on TLP, before your patch, it panic()s on AC, but not on battery
- on TLP, with your patch, it panic()s on battery, but NOT on AC
- if TLP is masked, it still panic()s, but I forget if AC or battery
- even if I mask TLP, with your commit in place it panic()s

>   - If you revert 4257f7e008ea, does hibernate/resume work fine?  Both
> with the tlp tweak and without?

Definitely with the revert everything works. I'll try your patch after the
revert and see if anything changes.

-Kenny

-- 
Kenneth R. Crudup  Sr. SW Engineer, Scott County Consulting, Orange County CA


Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures

2020-12-27 Thread Kenneth R. Crudup

On Sun, 27 Dec 2020, Bjorn Helgaas wrote:

> If it's convenient, can you try the patch below?

Will do!

Also:

>   - Are you seeing the hibernate/resume problem when on AC or on
> battery?

Um, I forget :) but want to say "both". I'll try both ways and let you know.

>   - If you revert 4257f7e008ea, does hibernate/resume work fine?  Both
> with the tlp tweak and without?

Yeah, but TBH there were two other PM regressions in this -rc cycle, so
you guys are in good company :)

>   - Collect "sudo lspci -vv" output before hibernate and (if possible)
> after resume when you see the problem.

See attached.

> I guess tlp only uses /sys/module/pcie_aspm/parameters/policy, so it
> sets the same ASPM policy system-wide.

Yeah.

-Kenny

-- 
Kenneth R. Crudup  Sr. SW Engineer, Scott County Consulting, Orange County CA00:00.0 Host bridge [0600]: Intel Corporation Device [8086:8a12] (rev 03)
Subsystem: Dell Device [1028:08b0]
Flags: bus master, fast devsel, latency 0, IOMMU group 0
Capabilities: [e0] Vendor Specific Information: Len=10 
Kernel driver in use: icl_uncore

00:02.0 VGA compatible controller [0300]: Intel Corporation Iris Plus Graphics 
G7 [8086:8a52] (rev 07) (prog-if 00 [VGA controller])
DeviceName: To Be Filled by O.E.M.
Subsystem: Dell Iris Plus Graphics G7 [1028:08b0]
Flags: bus master, fast devsel, latency 0, IRQ 201, IOMMU group 1
Memory at 603d00 (64-bit, non-prefetchable) [size=16M]
Memory at 40 (64-bit, prefetchable) [size=256M]
I/O ports at 4000 [size=64]
Expansion ROM at 000c [virtual] [disabled] [size=128K]
Capabilities: [40] Vendor Specific Information: Len=0c 
Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00
Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit-
Capabilities: [d0] Power Management version 2
Capabilities: [100] Process Address Space ID (PASID)
Capabilities: [200] Address Translation Service (ATS)
Capabilities: [300] Page Request Interface (PRI)
Kernel driver in use: i915
Kernel modules: i915

00:04.0 Signal processing controller [1180]: Intel Corporation Device 
[8086:8a03] (rev 03)
Subsystem: Dell Device [1028:08b0]
Flags: fast devsel, IRQ 16, IOMMU group 2
Memory at 603eba (64-bit, non-prefetchable) [size=64K]
Capabilities: [90] MSI: Enable- Count=1/1 Maskable- 64bit-
Capabilities: [d0] Power Management version 3
Capabilities: [e0] Vendor Specific Information: Len=0c 
Kernel driver in use: proc_thermal

00:05.0 Multimedia controller [0480]: Intel Corporation Device [8086:8a19] (rev 
03)
Subsystem: Dell Device [1028:08b0]
Flags: fast devsel, IRQ 255, IOMMU group 3
Memory at 603c00 (64-bit, non-prefetchable) [disabled] [size=16M]
Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00
Capabilities: [ac] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [d0] Power Management version 3

00:07.0 PCI bridge [0604]: Intel Corporation Ice Lake Thunderbolt 3 PCI Express 
Root Port #0 [8086:8a1d] (rev 03) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0, IRQ 155, IOMMU group 4
Bus: primary=00, secondary=01, subordinate=2d, sec-latency=0
I/O behind bridge: 5000-6fff [size=8K]
Memory behind bridge: 7e00-8a1f [size=194M]
Prefetchable memory behind bridge: 0060-00601bff 
[size=448M]
Capabilities: [40] Express Root Port (Slot+), MSI 00
Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
Capabilities: [90] Subsystem: Device [:]
Capabilities: [a0] Power Management version 3
Capabilities: [100] Null
Capabilities: [220] Access Control Services
Capabilities: [a00] Downstream Port Containment
Kernel driver in use: pcieport

00:07.2 PCI bridge [0604]: Intel Corporation Ice Lake Thunderbolt 3 PCI Express 
Root Port #2 [8086:8a21] (rev 03) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0, IRQ 156, IOMMU group 5
Bus: primary=00, secondary=2e, subordinate=58, sec-latency=0
I/O behind bridge: 7000-7fff [size=4K]
Memory behind bridge: 7000-7c1f [size=194M]
Prefetchable memory behind bridge: 00602000-00603bff 
[size=448M]
Capabilities: [40] Express Root Port (Slot+), MSI 00
Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
Capabilities: [90] Subsystem: Device [:]
Capabilities: [a0] Power Management version 3
Capabilities: [100] Null
Capabilities: [220] Access Control Services
Capabilities: [a00] Downstream Port Containment
Kernel driver in use: pcieport

00:0d.0 USB controller [0c03]: Intel Corporation Ice Lake 

Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures

2020-12-27 Thread Bjorn Helgaas
> From: Kenneth R. Crudup 
> 
> I've been running Linus' master branch on my laptop (Dell XPS 13
> 2-in-1).  With this commit in place, after resuming from hibernate
> my machine is essentially useless, with a torrent of disk I/O errors
> on my NVMe device (at least, and possibly other devices affected)
> until a reboot.
> 
> I do use tlp to set the PCIe ASPM to "performance" on AC and
> "powersupersave" on battery.

Thanks a lot for the report, and sorry for the breakage.

4257f7e008ea restores PCI_L1SS_CTL1, then PCI_L1SS_CTL2.  I think it
should do those in the reverse order, since the Enable bits are in
PCI_L1SS_CTL1.  It also restores L1SS state (potentially enabling
L1.x) before we restore the PCIe Capability (potentially enabling ASPM
as a whole).  Those probably should also be in the other order.

If it's convenient, can you try the patch below?  If the debug patch
doesn't help:

  - Are you seeing the hibernate/resume problem when on AC or on
battery?

  - What if you don't use tlp?  Does hibernate/resume work fine then?
If tlp makes a difference, can you collect "sudo lspci -vv" output
with and without using tlp (before hibernate)?

  - If you revert 4257f7e008ea, does hibernate/resume work fine?  Both
with the tlp tweak and without?

  - Collect "sudo lspci -vv" output before hibernate and (if possible)
after resume when you see the problem.

I guess tlp only uses /sys/module/pcie_aspm/parameters/policy, so it
sets the same ASPM policy system-wide.

Bjorn

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index b9fecc25d213..6598b5cd3154 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1665,9 +1665,8 @@ void pci_restore_state(struct pci_dev *dev)
 * LTR itself (in the PCIe capability).
 */
pci_restore_ltr_state(dev);
-   pci_restore_aspm_l1ss_state(dev);
-
pci_restore_pcie_state(dev);
+   pci_restore_aspm_l1ss_state(dev);
pci_restore_pasid_state(dev);
pci_restore_pri_state(dev);
pci_restore_ats_state(dev);
diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
index a08e7d6dc248..c4a99274b4bb 100644
--- a/drivers/pci/pcie/aspm.c
+++ b/drivers/pci/pcie/aspm.c
@@ -752,8 +752,8 @@ void pci_save_aspm_l1ss_state(struct pci_dev *dev)
return;
 
cap = (u32 *)_state->cap.data[0];
-   pci_read_config_dword(dev, aspm_l1ss + PCI_L1SS_CTL1, cap++);
pci_read_config_dword(dev, aspm_l1ss + PCI_L1SS_CTL2, cap++);
+   pci_read_config_dword(dev, aspm_l1ss + PCI_L1SS_CTL1, cap++);
 }
 
 void pci_restore_aspm_l1ss_state(struct pci_dev *dev)
@@ -774,8 +774,8 @@ void pci_restore_aspm_l1ss_state(struct pci_dev *dev)
return;
 
cap = (u32 *)_state->cap.data[0];
-   pci_write_config_dword(dev, aspm_l1ss + PCI_L1SS_CTL1, *cap++);
pci_write_config_dword(dev, aspm_l1ss + PCI_L1SS_CTL2, *cap++);
+   pci_write_config_dword(dev, aspm_l1ss + PCI_L1SS_CTL1, *cap++);
 }
 
 static void pcie_config_aspm_dev(struct pci_dev *pdev, u32 val)