[Bug 1803179] Re: System does not reliably come out of suspend

Bug Watch Updater Thu, 29 Nov 2018 01:21:44 -0800

Launchpad has imported 126 comments from the remote bug at
https://bugzilla.kernel.org/show_bug.cgi?id=156341.

If you reply to an imported comment from within Launchpad, your comment
will be sent to the remote bug automatically. Read more about
Launchpad's inter-bugtracker facilities at
https://help.launchpad.net/InterBugTracking.

------------------------------------------------------------------------
On 2016-09-08T10:25:10+00:00 peter wrote:

Created attachment 232611
dmesg for v4.7-rc5 (triggered runtime-resume via writing "on" to (nvidia 
device)/power/control)

See also https://www.spinics.net/lists/linux-pci/msg53694.html ("Kernel
Freeze with American Megatrends BIOS") for more details (acpidump,
lspci, some analysis, etc.).

Steps to reproduce on the affected machines:

 1. Load nouveau.
 2. Wait for it to runtime suspend.
 2. Invoke 'lspci', this resumes the Nvidia PCI device via nouveau. 
(alternatively: write "on" to /sys/bus/pci/devices/0000:01:00.0/power/control)
 3. lspci never returns, few moments later an AML_INFINITE_LOOP is reported.

Affected machines from
https://github.com/Bumblebee-Project/Bumblebee/issues/764#issuecomment-234494238
- Clevo P651RA (and other Clevo P6xxRx models).
- MSI GE62 Apache Pro
- Gigabyte P35V5
- Razer Blade 14" (2016)
- Dell Inspiron 7559

These *new* laptops all have an Skylake CPU (i7-6500HQ) and a Nvidia GTX
9xxM GPU. Originally it was only observed for laptops with AMI BIOSes,
but later we found a Dell laptop as well. The workaround
acpi_osi="!Windows 2015" prevents Linux from reporting Windows 10
compatibility and helps *in some cases* because the ACPI code falls back
to a different approach to power on the device (or PCIe link?).

Attached is one of the more interesting dmesg dumps which could be
obtained that shows how the system breaks down over time. (This was
v4.7-rc5 with PCI/PM D3cold + nouveau power resource/PM refcount leaks
patches, but the problem was also visible on unpatches 4.4.0 for
example.)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/0

------------------------------------------------------------------------
On 2016-09-12T10:51:42+00:00 rui.zhang wrote:

let's focus on one platform first.
For people who encounters this problem and can give quick response, please 
attach the acpidump of the platform.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/1

------------------------------------------------------------------------
On 2016-09-12T10:54:29+00:00 rui.zhang wrote:

Okay, let's focus on Clevo_P651RA first.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/2

------------------------------------------------------------------------
On 2016-09-12T10:55:33+00:00 rui.zhang wrote:

I don't see how to download the acpidump file at 
https://github.com/Lekensteyn/acpi-stuff/blob/master/dsl/Clevo_P651RA/acpidump.txt
can you please attach it here?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/3

------------------------------------------------------------------------
On 2016-09-12T12:07:08+00:00 peter wrote:

Created attachment 233091
acpidump for Clevo P651RA (BIOS 1.05.07)

You can download the file via the "Raw" link on Github. I have attached
a copy of the acpidump.

Of interest is the \_SB.PCI0.PGON method. See also this extract:
https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-P651RA/notes.txt#L94

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/4

------------------------------------------------------------------------
On 2016-09-19T23:49:11+00:00 lenb wrote:

Does this still fail if you use the proprietary nvidia driver?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/5

------------------------------------------------------------------------
On 2016-09-20T02:36:24+00:00 lv.zheng wrote:

Peter:
Should you first try this: attachment 239241

Rui:
Do you have PCI contact? Can we have them to look at the issue first?
>From this link:
https://www.spinics.net/lists/linux-pci/msg53694.html 
Looks like a PCI power management gap if the attachment 239241 doesn't help.

Thanks
Lv

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/6

------------------------------------------------------------------------
On 2016-09-20T08:31:09+00:00 peter wrote:

(In reply to Len Brown from comment #5)
> Does this still fail if you use the proprietary nvidia driver?

I have not tried the proprietary driver, but AFAIK the blob does no
attempts to put the device in D3 state.

(In reply to Lv Zheng from comment #6)
> Peter:
> Should you first try this: attachment 239241 [details]

I can try, but would it really help? Not all firmware have this loop and
they will just assume that the link state is correct. This is the
affected loop:

    While ((\_SB.PCI0.PEG0.LNKS < 0x07)) {
        Local0 = 0x20
        While (Local0) {
            If ((\_SB.PCI0.PEG0.LNKS < 0x07)) {
                Stall (0x64)
                Local0--
            } Else {
                Break
            }
        }

        If ((Local0 == Zero)) {
            \_SB.PCI0.PEG0.RTLK = One
            Stall (0x64)
        }
    }

In one trace I observed that the outer loop was executed 29 times which
means that about 29 * (32 * 100us + 100us) = 95.7ms.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/7

------------------------------------------------------------------------
On 2016-09-21T05:53:35+00:00 lv.zheng wrote:

Do you mean it's already long enough (95.7ms) for this case, and waiting longer 
won't solve the issue?
I don't know, I just want to get rid of the possible bug causes.

I'm not a PCI expert. So let me ask.
>From the following AML, RTLK/LNKS belong to a PCI register space:
    OperationRegion (SANV, SystemMemory, 0x5FF9BD98, 0x0135)
    Field (SANV, AnyAcc, Lock, Preserve)
    {
        ASLB,   32, 
        IMON,   8, 
        IGDS,   8, 
        IBTT,   8, 
        IPAT,   8, 
        IPSC,   8, 
        IBIA,   8, 
        ISSC,   8, 
        IDMS,   8, 
        IF1E,   8, 
        HVCO,   8, 
        GSMI,   8, 
        PAVP,   8, 
        CADL,   8, 
        CSTE,   16, 
        NSTE,   16, 
        NDID,   8, 
        DID1,   32, 
        DID2,   32, 
        DID3,   32, 
        DID4,   32, 
        DID5,   32, 
        DID6,   32, 
        DID7,   32, 
        DID8,   32, 
        DID9,   32, 
        DIDA,   32, 
        DIDB,   32, 
        DIDC,   32, 
        DIDD,   32, 
        DIDE,   32, 
        DIDF,   32, 
        DIDX,   32, 
        NXD1,   32, 
        NXD2,   32, 
        NXD3,   32, 
        NXD4,   32, 
        NXD5,   32, 
        NXD6,   32, 
        NXD7,   32, 
        NXD8,   32, 
        NXDX,   32, 
        LIDS,   8, 
        KSV0,   32, 
        KSV1,   8, 
        BRTL,   8, 
        ALSE,   8, 
        ALAF,   8, 
        LLOW,   8, 
        LHIH,   8, 
        ALFP,   8, 
        IMTP,   8, 
        EDPV,   8, 
        SGMD,   8, 
        SGFL,   8, 
        SGGP,   8, 
        HRE0,   8, 
        HRG0,   32, 
        HRA0,   8, 
        PWE0,   8, 
        PWG0,   32, 
        PWA0,   8, 
        P1GP,   8, 
        HRE1,   8, 
        HRG1,   32, 
        HRA1,   8, 
        PWE1,   8, 
        PWG1,   32, 
        PWA1,   8, 
        P2GP,   8, 
        HRE2,   8, 
        HRG2,   32, 
        HRA2,   8, 
        PWE2,   8, 
        PWG2,   32, 
        PWA2,   8, 
        DLPW,   16, 
        DLHR,   16, 
        EECP,   8, 
        XBAS,   32, <- XBAS
        GBAS,   16, 
        NVGA,   32, 
        NVHA,   32, 
        AMDA,   32, 
        LTRX,   8, 
        OBFX,   8, 
        LTRY,   8, 
        OBFY,   8, 
        LTRZ,   8, 
        OBFZ,   8, 
        SMSL,   16, 
        SNSL,   16, 
        P0UB,   8, 
        P1UB,   8, 
        P2UB,   8, 
        PCSL,   8, 
        PBGE,   8, 
        M64B,   64, 
        M64L,   64, 
        CPEX,   32, 
        EEC1,   8, 
        EEC2,   8, 
        SBN0,   8, 
        SBN1,   8, 
        SBN2,   8, 
        M32B,   32, 
        M32L,   32, 
        P0WK,   32, 
        P1WK,   32, 
        P2WK,   32, 
        MXD1,   32, 
        MXD2,   32, 
        MXD3,   32, 
        MXD4,   32, 
        MXD5,   32, 
        MXD6,   32, 
        MXD7,   32, 
        MXD8,   32, 
        PXFD,   8, 
        EBAS,   32, 
        DGVS,   32, 
        DGVB,   32, 
        HYSS,   32
    }

        OperationRegion (RPCX, SystemMemory, Add (\XBAS, 0x8000), 0x1000)
        Field (RPCX, ByteAcc, NoLock, Preserve)
        {
            Offset (0x04), 
            CMDR,   8, 
            Offset (0x84), 
            D0ST,   2, 
            Offset (0xAA), 
            CEDR,   1, 
            Offset (0xB0), 
                ,   5, 
            RTLK,   1, <- RTLK
            Offset (0xC9), 
                ,   2, 
            LREN,   1, 
            Offset (0x216), 
            LNKS,   4, <- LNKS
        }
Can you infer what it is from the above AML?

Thanks

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/8

------------------------------------------------------------------------
On 2016-09-21T05:56:59+00:00 lv.zheng wrote:

It looks like AML code in PGON prior than this loop should always make the 
condition true. What the platform need to do is to wait.
So IMO, the code prior than this loop is more important for root causing this 
issue.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/9

------------------------------------------------------------------------
On 2016-09-21T12:23:08+00:00 peter wrote:

(In reply to Lv Zheng from comment #8)
> Do you mean it's already long enough (95.7ms) for this case, and waiting
> longer won't solve the issue?

That would be the theoretical delay. In practice, I have several seconds
of processing due to ACPI debug logging (ACPI_NAMESPACE, ACPI_DB_NAMES).
The logs stop after 46 seconds, maybe because I used SysRq+B for a
forced reboot (reset).

> I'm not a PCI expert. So let me ask.
> From the following AML, RTLK/LNKS belong to a PCI register space:
>     OperationRegion (SANV, SystemMemory, 0x5FF9BD98, 0x0135)
>     Field (SANV, AnyAcc, Lock, Preserve)
>     {
[snip]
> Can you infer what it is from the above AML?

XBAS is the PCIe MMIO Base Address register. I guessed that "RTLK" means
"Retrain Link" (see PCIe spec 7.8.7 Link Control Register) and that
"LNKS" means PCIe Link speed. I posted these on:

https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-
P651RA/notes.txt

(In reply to Lv Zheng from comment #9)
> It looks like AML code in PGON prior than this loop should always make the
> condition true. What the platform need to do is to wait.
> So IMO, the code prior than this loop is more important for root causing
> this issue.

The loop is indeed just a consequence, the root cause is due to the
difference between invoking the "LKEN" code (problematic, see line 120
of notes.txt) and the fallback code (see line 141 of notes.txt).

However I am quite at loss on why it would be so significant. Note that
I am no PCI expert either, the notes were based on the PCIe spec, ACPI
tables and lots of guesswork.

Do you need more info?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/10

------------------------------------------------------------------------
On 2016-09-22T02:43:58+00:00 lv.zheng wrote:

Let me re-assign it to Power-management category and reset the assignee
to involve more developers.

Thanks
Lv

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/11

------------------------------------------------------------------------
On 2016-09-27T00:09:41+00:00 rjw wrote:

Peter, one question: Why is this not regarded as a nouveau problem?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/12

------------------------------------------------------------------------
On 2016-09-27T09:28:34+00:00 peter wrote:

(In reply to Rafael J. Wysocki from comment #12)
> Peter, one question: Why is this not regarded as a nouveau problem?

Something changed in Windows 10 that made firmware authors write this
specific DSDT workaround. If Linux advertises itself as Windows 7 for
example, the problematic code is not triggered. (Some laptops also work
when advertising "non-Windows 10", such as Windows 8).

It could be a missing piece in the nouveau driver, but exactly how to
tackle that is not known. In a minimal module that uses the new PCI port
runtime PM ("PR3 support") introduced with v4.8, I could also trigger
the lockups.

Are you aware of changes to the policies in Windows 10 that could
explain the different methods of putting a device into D3? Timing-wise
or other APIs changes?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/13

------------------------------------------------------------------------
On 2016-09-28T00:31:04+00:00 rjw wrote:

On Tuesday, September 27, 2016 09:28:34 AM bugzilla-dae...@bugzilla.kernel.org 
wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=156341
> 
> --- Comment #13 from Peter Wu <pe...@lekensteyn.nl> ---
> (In reply to Rafael J. Wysocki from comment #12)
> > Peter, one question: Why is this not regarded as a nouveau problem?
> 
> Something changed in Windows 10 that made firmware authors write this
> specific
> DSDT workaround. If Linux advertises itself as Windows 7 for example, the
> problematic code is not triggered. (Some laptops also work when advertising
> "non-Windows 10", such as Windows 8).
> 
> It could be a missing piece in the nouveau driver, but exactly how to tackle
> that is not known. In a minimal module that uses the new PCI port runtime PM
> ("PR3 support") introduced with v4.8, I could also trigger the lockups.
> 
> Are you aware of changes to the policies in Windows 10 that could explain the
> different methods of putting a device into D3? Timing-wise or other APIs
> changes?

Not at the moment, but I'm going to ask around.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/14

------------------------------------------------------------------------
On 2016-09-29T21:20:00+00:00 rjw wrote:

One difference between Windows 10 and Windows 7 I know about is that
Windows 10 supports power management of PCIe ports and I bet the ASL in
comment #7 is needed to cope with that.

That PCIe ports PM appears to be different from what we're going to do
in 4.8+, though, which may be the source of the problem.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/15

------------------------------------------------------------------------
On 2016-09-29T21:59:04+00:00 peter wrote:

(In reply to Rafael J. Wysocki from comment #15)
> One difference between Windows 10 and Windows 7 I know about is that Windows
> 10 supports power management of PCIe ports and I bet the ASL in comment #7
> is needed to cope with that.
> 
> That PCIe ports PM appears to be different from what we're going to do in
> 4.8+, though, which may be the source of the problem.

The invoked ACPI methods (_ON/_OFF on the power resource) are matching between 
Linux and Windows 10. From a packet capture with WinDbg kernel debugger:
https://lekensteyn.nl/files/p651ra-acpi-debug/acpi-evals.txt

Maybe some extra modifications are needed to the PCIe registers? (No
idea, just guessing.)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/16

------------------------------------------------------------------------
On 2016-10-24T10:30:13+00:00 samm wrote:

Tested against 4.9-RC2 on Fedora 25 and the problem still exists

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/17

------------------------------------------------------------------------
On 2016-11-03T12:55:50+00:00 peter wrote:

(In reply to Rafael J. Wysocki from comment #15)
> That PCIe ports PM appears to be different from what we're going to do in
> 4.8+, though, which may be the source of the problem.

This is not the source of the problem, the issue exists before with
older kernels.

The list of affected models keeps growing, there have been reports from 
additional HP, Dell and Asus laptops. All of these have in common a Skylake CPU 
(i7-6700HQ) and some NVIDIA GPU (Maxwell cards, GTX 950M/960M/965M/970M, Quadro 
M1000M). Some of the laptops are listed at the updated list in
https://github.com/Bumblebee-Project/Bumblebee/issues/764#issuecomment-234494238

Any idea what to look into? Patches, documentation or other possible
hints?

Failing a long-term solution, I am considering a temporary ACPI hack that 
patches the affected ACPI method to disable the conditional OSYS check:
https://github.com/Bumblebee-Project/bbswitch/issues/134#issuecomment-258117908

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/18

------------------------------------------------------------------------
On 2016-11-03T13:10:19+00:00 samm wrote:

Created attachment 243561
attachment-10037-0.html

Auto-reply: I'm out of the office at present and will be back in on the
7th, please contact syst...@infoxchange.org if you require a response.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/19

------------------------------------------------------------------------
On 2016-11-16T21:49:18+00:00 rjw wrote:

(In reply to Peter Wu from comment #18)
> (In reply to Rafael J. Wysocki from comment #15)
> > That PCIe ports PM appears to be different from what we're going to do in
> > 4.8+, though, which may be the source of the problem.
> 
> This is not the source of the problem, the issue exists before with older
> kernels.
> 
> The list of affected models keeps growing, there have been reports from
> additional HP, Dell and Asus laptops. All of these have in common a Skylake
> CPU (i7-6700HQ) and some NVIDIA GPU (Maxwell cards, GTX 950M/960M/965M/970M,
> Quadro M1000M). Some of the laptops are listed at the updated list in
> https://github.com/Bumblebee-Project/Bumblebee/issues/764#issuecomment-
> 234494238
> 
> Any idea what to look into? Patches, documentation or other possible hints?

You said that acpi_osi="!Windows 2015" helped in some cases.  I guess
the other cases (where it doesn't help) are Windows 10 only systems?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/20

------------------------------------------------------------------------
On 2016-11-16T21:52:49+00:00 rjw wrote:

And what if we simply avoided using ACPI PM with the affected device on
those systems?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/21

------------------------------------------------------------------------
On 2016-11-16T23:42:41+00:00 peter wrote:

> You said that acpi_osi="!Windows 2015" helped in some cases.  I guess the
> other cases (where it doesn't help) are Windows 10 only systems?

Not sure, I did not check if these systems have support for just w10
(and not 7, 8 or 8.1). Some others require acpi_osi=! acpi_osi="Windows
2009" to avoid the problematic code path in the ACPI table.

(In reply to Rafael J. Wysocki from comment #21)
> And what if we simply avoided using ACPI PM with the affected device on
> those systems?

You mean acpi=off? Avoiding runtime pm nouveau would be sufficient but
kills battery life. One interesting observation is that turning off the
ACPI power resource (via PCIe port PM) or system sleep seems not to
trigger the issue. (Compared to using nouveau.) Maybe I'm dreaming, have
to retest this just to be sure.

Do you have tips for tracing PCI register activities? (E.g. read/write
pm regs)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/22

------------------------------------------------------------------------
On 2017-01-22T17:32:16+00:00 billybrawner wrote:

Hi everyone, I'm hoping to provide some helpful information here. I'm
affected by this bug, in that I can't login to gnome unless I either
blacklist the nouveau module or add "nouveau.runpm=0" to my kernel
parameters. I've got some files here that I hope are of use to you:

Link to laptop: https://www.newegg.com/Product/Product.aspx?Item=N82E16834234412
Link to call trace where I can't login: 
https://paste.fedoraproject.org/533827/14851039/
Tar archive with system info: 
http://wbrawner.com/files/ASUSTeK_COMPUTER_INC.-X550VX.tar.gz

For what it's worth, I can't even get to the login screen with the
proprietary nVidia drivers. Please let me know if I can otherwise be of
assistance.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/23

------------------------------------------------------------------------
On 2017-01-28T23:37:14+00:00 agronick wrote:

I just wanted to report that this issue is present on Lenovo W541 with
4.9.4-1. You can see my full bug report here for the symptoms:
https://bugzilla.opensuse.org/show_bug.cgi?id=1022443

Putting acpi_osi=! acpi_osi="Windows 2009" in the Grub boot line fixed
it.

Here are my GPUs:
00:02.0 VGA compatible controller: Intel Corporation 4th Gen Core Processor 
Integrated Graphics Controller (rev 06)
01:00.0 VGA compatible controller: NVIDIA Corporation GK107GLM [Quadro K1100M] 
(rev a1)

cpuinfo prints:
Vendor ID: GenuineIntel
Hardware Raw: 
Brand: Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz
Hz Advertised: 2.8000 GHz
Hz Actual: 2.8000 GHz
Hz Advertised Raw: (2800000000, 0)
Hz Actual Raw: (2800000000, 0)
Arch: X86_64
Bits: 64
Count: 8
Raw Arch String: x86_64
L2 Cache Size: 6144 KB
L2 Cache Line Size: 0
L2 Cache Associativity: 0
Stepping: 3
Model: 60
Family: 6
Processor Type: 0
Extended Model: 0
Extended Family: 0
Flags: abm, acpi, aes, aperfmperf, apic, arat, arch_perfmon, avx, avx2, bmi1, 
bmi2, bts, clflush, cmov, constant_tsc, cx16, cx8, de, ds_cpl, dtes64, dtherm, 
dts, eagerfpu, epb, ept, erms, est, f16c, flexpriority, fma, fpu, fsgsbase, 
fxsr, ht, ida, invpcid, lahf_lm, lm, mca, mce, mmx, monitor, movbe, msr, mtrr, 
nonstop_tsc, nopl, nx, pae, pat, pbe, pcid, pclmulqdq, pdcm, pdpe1gb, pebs, 
pge, pln, pni, popcnt, pse, pse36, pts, rdrand, rdtscp, rep_good, sdbg, sep, 
smep, smx, ss, sse, sse2, sse4_1, sse4_2, ssse3, syscall, tm, tm2, tpr_shadow, 
tsc, tsc_adjust, tsc_deadline_timer, vme, vmx, vnmi, vpid, x2apic, xsave, 
xsaveopt, xtopology, xtpr

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/24

------------------------------------------------------------------------
On 2017-02-12T15:58:21+00:00 gbloisi wrote:

This issue is present on ASUS n552vw-fi056t (Core i7-6700HQ and NVIDIA
GeForce GTX 960M) with kernel 4.9.9.

Putting acpi_osi=! acpi_osi="Windows 2009" in the Grub  worked around
the problem, even though some functionality is lost (screen dimmering
shortcuts).

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/25

------------------------------------------------------------------------
On 2017-02-15T04:54:57+00:00 a.pronobis wrote:

The issue is also present on KabyLake Dell XPS15 9560 with i7-7700HQ
with NVidia GTX1050M. It manifests itself with complete freezes if the
intel card is used for X and the NVidia card is disabled with bumblebee.
Then, running nvidia-smi, lspci casuses freeze. The freezes do not
happen if NVidia card is enabled using bbswitch.

Some info from lspci:

00:02.0 VGA compatible controller: Intel Corporation Device 591b (rev 04)
01:00.0 3D controller: NVIDIA Corporation Device 1c8d (rev a1)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/26

------------------------------------------------------------------------
On 2017-02-15T04:56:21+00:00 a.pronobis wrote:

I would like to add that acpi_osi="!Windows 2015" does not solve the
problem, while  acpi_osi=! acpi_osi="Windows 2009" does (it does disable
the touchpad though).

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/27

------------------------------------------------------------------------
On 2017-02-15T06:30:53+00:00 rui.zhang wrote:

@Andrzej, please attach the acpidump output of your laptop.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/28

------------------------------------------------------------------------
On 2017-02-15T07:53:25+00:00 a.pronobis wrote:

Created attachment 254763
acpidump for Dell XPS15 9560 KabyLake i7-7700HQ/GTX1050M BIOS 1.0.3

Here comes the acpidump for my system: Dell XPS15 9560 KabyLake
i7-7700HQ/GTX1050M BIOS 1.0.3

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/29

------------------------------------------------------------------------
On 2017-02-15T21:11:46+00:00 gbloisi wrote:

Created attachment 254777
acpidump for ASUS N552VW-FI056T SkyLake i7-6700HQ/GTX 960M BIOS 3.0.0

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/30

------------------------------------------------------------------------
On 2017-02-19T07:29:19+00:00 a.pronobis wrote:

Is there anything else I can do to help debug this issue?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/31

------------------------------------------------------------------------
On 2017-02-21T01:26:08+00:00 tobe.schumacher wrote:

Created attachment 254843
Patch for XPS 9560

I am facing the same problem on an XPS 9560 and had a look at the
acpidump, here the corresponding check is as follows:

If ((OSYS <= 0x07D9) || ((OSYS == 0x07DF) && (_REV == 0x05)))

So, telling the BIOS that we support ACPI Rev. 5 should be sufficient
for this model to allow powering down the Nvidia without locking up.
There is already some code which does this for other XPS and Latitude
models in drivers/acpi/blacklist.c, I extended it for the XPS 9560. I
also sent the patch to the LKML.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/32

------------------------------------------------------------------------
On 2017-02-21T04:48:48+00:00 a.pronobis wrote:

Thanks Tobias! I tried your patch against 4.10 kernel and indeed it does
solve the freeze on Dell XPS 9560.

I did experience problems when disabling the card on battery with TLP
on. What solved it was adding the NVidia card to TLP
RUNTIME_PM_BLACKLIST.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/33

------------------------------------------------------------------------
On 2017-02-27T22:37:51+00:00 tobe.schumacher wrote:

Update: the patch didn't get accepted, but I got the hint to try booting
with acpi_rev_override.

acpi_rev_override=5 instead of the acpi_osi stuff works for me
(currently on kernel 4.8), no lockups and the touchpad issues also seem
to be gone.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/34

------------------------------------------------------------------------
On 2017-03-01T22:17:07+00:00 bruno.n.pagani wrote:

Created attachment 255035
acpidump for HP zBook Studio G3

I’m attaching acpidump for HP zBook Studio G3. Things are likely
happening in the SSDT3 table, but can’t understand what is the problem.
Maybe something around that PEGS function…

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/35

------------------------------------------------------------------------
On 2017-03-01T23:17:31+00:00 peter wrote:

(In reply to Bruno Pagani from comment #35)
> Created attachment 255035 [details]
> acpidump for HP zBook Studio G3
> 
> I’m attaching acpidump for HP zBook Studio G3. Things are likely happening
> in the SSDT3 table, but can’t understand what is the problem. Maybe
> something around that PEGS function…

PEGS only reads from an address (should not have side-effects). The problem is 
in PGON where the "LKEN" function is somehow problematic and the fallback ACPI 
code (_OSI="Windows 2009" for your model) avoids the issue.
(BTW, I have another friend with same laptop, that workaround worked for him.)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/36

------------------------------------------------------------------------
On 2017-03-01T23:21:22+00:00 bruno.n.pagani wrote:

(In reply to Peter Wu from comment #36)
> (In reply to Bruno Pagani from comment #35)
> > Created attachment 255035 [details]
> > acpidump for HP zBook Studio G3
> > 
> > I’m attaching acpidump for HP zBook Studio G3. Things are likely happening
> > in the SSDT3 table, but can’t understand what is the problem. Maybe
> > something around that PEGS function…
> 
> PEGS only reads from an address (should not have side-effects). The problem
> is in PGON where the "LKEN" function is somehow problematic and the fallback
> ACPI code (_OSI="Windows 2009" for your model) avoids the issue.
> (BTW, I have another friend with same laptop, that workaround worked for
> him.)

Thanks Peter!

However _OSI="Windows 2009" must have downsides, right? Would it be only
Nvidia card PCIe port PM or something in the like.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/37

------------------------------------------------------------------------
On 2017-03-02T16:39:59+00:00 peter wrote:

(In reply to Bruno Pagani from comment #37)
> However _OSI="Windows 2009" must have downsides, right? Would it be only
> Nvidia card PCIe port PM or something in the like.

Maybe hotkeys work may behave differently, but I did not check this.
(The effects are model-specific, effectively you are telling the
firmware that the OS is Windows 7 which may not support firmware
features for newer Windows versions.)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/38

------------------------------------------------------------------------
On 2017-03-12T22:15:30+00:00 gbloisi wrote:

Just as an update on my ASUS N552VW-FI056T the "acpi_rev_override=5"
works better than the acpi_osi="!Windows 2015" workaround: lockup
disapperas and backlight function keys are working.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/39

------------------------------------------------------------------------
On 2017-03-28T12:01:46+00:00 pranav.sharma.ama wrote:

I tested out acpi_rev_override=5 on my dell inspiron 7559, and it got
stuck while booting. acpi_osi="!Windows 2015" is still needed for this
laptop. I am on ubuntu 16.04 and using nvidia drivers with prime in
intel mode.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/40

------------------------------------------------------------------------
On 2017-03-29T00:03:19+00:00 bruno.n.pagani wrote:

(In reply to Peter Wu from comment #22)
> (In reply to Rafael J. Wysocki from comment #21)
> > And what if we simply avoided using ACPI PM with the affected device on
> > those systems?
> 
> You mean acpi=off? Avoiding runtime pm nouveau would be sufficient but kills
> battery life.

That would indeed be a terrible fix.

> One interesting observation is that turning off the ACPI power
> resource (via PCIe port PM) or system sleep seems not to trigger the issue.
> (Compared to using nouveau.) Maybe I'm dreaming, have to retest this just to
> be sure.

So what I’m observing so far:

– nouveau works (doesn’t cause lockups), thanks to port PM I guess (it
doesn’t work anymore with pcie_port_pm=off → provokes lockup at loading
FWIR, can retest if needed). However, the powersavings seems to be less
than before, and especially I’ve got my fan spinning up and down
constantly on idle-like situations, which is really annoying.

– bbswitch with pcie_port_pm=off can turn off the card, but any attempt
to turn it ON again cause this lockup. However, powersavinsg seem good
(but no optimal, since not port_pm) and no fan spinning.

I still need to try using acpi_osi=! acpi_osi="Windows 2009", but this
is not normal and probably has downsides (at least this workaround has
downsides for other people).

Regarding fixing this: as there been any progress on whether the
differences in PCIe Port PM implementation between Windows and Linux
could be responsible here?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/41

------------------------------------------------------------------------
On 2017-03-29T09:52:24+00:00 remy.labene wrote:

Duplicate bug here https://bugzilla.kernel.org/show_bug.cgi?id=194431
with MSI GP72 6QE

grub options acpi_osi=! acpi_osi="Windows 2009" and nouveau driver is
the better solution at this moment.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/42

------------------------------------------------------------------------
On 2017-03-30T02:57:01+00:00 lv.zheng wrote:

(In reply to Peter Wu from comment #36)
> (In reply to Bruno Pagani from comment #35)
> > Created attachment 255035 [details]
> > acpidump for HP zBook Studio G3
> > 
> > I’m attaching acpidump for HP zBook Studio G3. Things are likely happening
> > in the SSDT3 table, but can’t understand what is the problem. Maybe
> > something around that PEGS function…
> 
> PEGS only reads from an address (should not have side-effects). The problem
> is in PGON where the "LKEN" function is somehow problematic and the fallback
> ACPI code (_OSI="Windows 2009" for your model) avoids the issue.
> (BTW, I have another friend with same laptop, that workaround worked for
> him.)

What's the problem in PGON?
Is it related to another issue you reported:
https://github.com/Bumblebee-Project/bbswitch/issues/142

Thanks
Lv

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/43

------------------------------------------------------------------------
On 2017-03-30T13:03:23+00:00 peter wrote:

(In reply to Lv Zheng from comment #43)
> (In reply to Peter Wu from comment #36)
[..]
> > PEGS only reads from an address (should not have side-effects). The problem
> > is in PGON where the "LKEN" function is somehow problematic and the
> fallback
> > ACPI code (_OSI="Windows 2009" for your model) avoids the issue.
> > (BTW, I have another friend with same laptop, that workaround worked for
> > him.)
> 
> What's the problem in PGON?
> Is it related to another issue you reported:
> https://github.com/Bumblebee-Project/bbswitch/issues/142

It is unrelated to that issue, the namespace lookup seems to work well. Here 
are related reports:
https://github.com/Bumblebee-Project/Bumblebee/issues/764
https://github.com/Bumblebee-Project/bbswitch/issues/137
https://github.com/Bumblebee-Project/bbswitch/issues/148

Any ideas where to look? If it helps, here is a summary of the executed ASL:
https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-P651RA/notes.txt#L171

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/44

------------------------------------------------------------------------
On 2017-04-09T07:19:20+00:00 pagetronic wrote:

Created attachment 255791
acpidump (WS72 with MS-1776 motherboard & Quadro M2000M).

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/45

------------------------------------------------------------------------
On 2017-04-20T15:58:53+00:00 hhfeuer wrote:

Might be unrelated but on affected machines I more than once saw this in dmesg:
>ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
Coincicidence or hint?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/46

------------------------------------------------------------------------
On 2017-04-29T14:01:58+00:00 bruno.n.pagani wrote:

(In reply to Maik Freudenberg from comment #46)
> Might be unrelated but on affected machines I more than once saw this in
> dmesg:
> >ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
> Coincicidence or hint?

Strange. Not sure if coincidence, but I would expect newer machine to
support ASPM. I’m having that too, but currently booting with acpi_osi=!
acpi_osi="Windows 2009", need to check without. If that changes, then
this issue is really annoying, because it means we’re loosing on power
savings to get some other ones working…

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/47

------------------------------------------------------------------------
On 2017-04-29T14:26:56+00:00 bruno.n.pagani wrote:

So just checked, and I was already having this before. Still not sure
what the implications are (here or in general).

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/48

------------------------------------------------------------------------
On 2017-05-03T14:44:39+00:00 golden wrote:

In case it's helpful, on an Alienware 13 R3 running kernel 4.10.0-20,
"acpi_rev_override=5" doesn't work, but the Windows acpi option fixes
the hang.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/49

------------------------------------------------------------------------
On 2017-05-17T14:10:45+00:00 AlexanderMangaard wrote:

Created attachment 256601
i7-6700HQ CPU. Nvidia GeForce GTX965m. Clevo model N150RF.

Linux Mint 18.1 Serena, 4.11.1-041101-generic (earlier 4.4 kernel also
affected, update does nothing). Also tried latest Elementary OS and LXLE
with same problem, I thought it was related to MDM, but now I suspect
Xorg.

i7-6700HQ CPU. System hangs on lspci, lshw and inxi commands, so when
probing hardware I guess. Has problems with freezing both with Nouveau
and several versions of proprietary nvidia drivers. Laptop is a Clevo
model N150RF with Nvidia GeForce GTX965m.

dmesg also shows

INFO: task upowerd:1580 blocked for more than 120 seconds.
      Not tainted 4.11.1-041101-generic #201705140931
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
---
iwlwifi 0000:03:00.0: Getting the temperature timed out
---
INFO: task kworker/6:1:73 blocked for more than 120 seconds.
      Not tainted 4.11.1-041101-generic #201705140931
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/50

------------------------------------------------------------------------
On 2017-05-17T14:54:31+00:00 AlexanderMangaard wrote:

using boot flags acpi_osi=! "acpi_osi=Windows 2009"

the following error messages show:

nouveau 0000:01:00.0: bus: MMIO read of 00000000 FAULT at 40912c [ IBUS
]

---

nouveau 0000:01:00.0: DRM: Pointer to flat panel table invalid

---

nouveau 0000:01:00.0: priv: GPC0: 419df4 00000000 (1940820e)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/51

------------------------------------------------------------------------
On 2017-05-17T16:01:02+00:00 peter wrote:

(In reply to Alexander from comment #51)
> nouveau 0000:01:00.0: DRM: Pointer to flat panel table invalid

Not a problem, it can safely be ignored. Xorg/lspci/lshw/etc are the
victims, the cause of the freezes is due to bad interaction between PCI
or ACPI with the platform firmware. A proper solution and the exact root
cause is still unknown.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/52

------------------------------------------------------------------------
On 2017-05-23T22:57:51+00:00 yaohan.chen wrote:

Created attachment 256687
acpidump for Samsung Notebook Spin 7

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/53

------------------------------------------------------------------------
On 2017-05-23T23:02:02+00:00 yaohan.chen wrote:

On Samsung Notebook Spin 7, I am getting a black screen when resuming
from suspend, with Xorg using 100% CPU and keyboard/mouse non-
responding. Setting the acpi_osi parameter did not help. I was told on
the NVIDIA forum that it has to do with this bug, so I've attached an
acpidump.

https://devtalk.nvidia.com/default/topic/1009973/linux/black-screen-and-
keyboard-freeze-after-resume-from-suspend-geforce-940mx-
nvidia-381-22-linux-4-11-2-/post/5152837/#5152837

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/54

------------------------------------------------------------------------
On 2017-05-26T10:44:05+00:00 robert.brocm wrote:

Created attachment 256725
MSI GP62 7RD acpidump

I'm also seeing this on an MSI GP62 7RD. The parameters 'acpi_osi=!
"acpi_osi=Windows 2009"' sort the issue for me, with no ill effect on
hotkeys, trackpad, backlight, etc.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/55

------------------------------------------------------------------------
On 2017-06-16T15:55:35+00:00 juan.cuzmar.s wrote:

I have an Asus GL553VD and to fix this issues I had to edit DSDT.
finally create a patch to an override the tables and put the file into the grub.
More information here: https://askubuntu.com/a/923216/680254

So it's the tippical DDSDT

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/56

------------------------------------------------------------------------
On 2017-06-16T15:58:58+00:00 juan.cuzmar.s wrote:

So it's the typical DDST malfunction to Linux
(I'm sorry, I hit tab+enter before finish my post)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/57

------------------------------------------------------------------------
On 2017-06-17T11:12:01+00:00 remy.labene wrote:

Created attachment 257047
Dmesg on MSI GP72 6QE after Juan mofication

I tried the modification proposed by Juan on my MSI GP72 6QE. The
modified table is the SSDT6. I remove grub options acpi_osi=!
acpi_osi="Windows 2009", the system is stable despite some boot error.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/58

------------------------------------------------------------------------
On 2017-06-17T11:25:56+00:00 peter wrote:

Note that DSDT/SSDT modifications and the acpi_osi options are only
*workarounds* specific for a model. It does not solve the root cause.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/59

------------------------------------------------------------------------
On 2017-06-17T13:08:58+00:00 pagetronic wrote:

I tried the modification proposed by Juan on a MSI WS72 laptop with
MS1776 motherboard and Quadro. The modified table is the SSDT6 also,
like Remy LABENE (MSI GP62 7RD) 2017-06-17 11:12:01 UTC,

but that does not solve the boot error.

What is your motherboard model Remy LABENE ? Can't find on google.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/60

------------------------------------------------------------------------
On 2017-06-17T14:15:44+00:00 remy.labene wrote:

Created attachment 257049
dmidecode Laptop MSI GP72 6QE

See the dmi decode file for model (MS-1795)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/61

------------------------------------------------------------------------
On 2017-06-17T15:53:02+00:00 remy.labene wrote:

I upgrade with the last bios (E1795IMS.119), and I boot on battery only, some 
errors are present(see below) but the system don't freeze with "nouveau" driver 
and 3D use.
....
[0.717662] ACPI Error: No handler for Region [EC__] (ffff8917360f0120) 
[EmbeddedControl] (20160930/evregion-166)
[    0.717729] ACPI Error: Region EmbeddedControl (ID=3) has no handler 
(20160930/exfldio-299)
[    0.717790] ACPI Error: Method parse/execution failed 
[\_SB.PCI0.LPCB.EC._REG] (Node ffff8917360f1370), AE_NOT_EXIST 
(20160930/psparse-543)
[    0.719570] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
....
nouveau 0000:01:00.0: DRM: VRAM: 2048 MiB
[    5.829676] nouveau 0000:01:00.0: DRM: GART: 1048576 MiB
[    5.829679] nouveau 0000:01:00.0: DRM: Pointer to TMDS table invalid
[    5.829679] nouveau 0000:01:00.0: DRM: DCB version 4.0
[    5.829680] nouveau 0000:01:00.0: DRM: Pointer to flat panel table invalid

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/62

------------------------------------------------------------------------
On 2017-06-19T15:26:31+00:00 juan.cuzmar.s wrote:

(In reply to Peter Wu from comment #59)
> Note that DSDT/SSDT modifications and the acpi_osi options are only
> *workarounds* specific for a model. It does not solve the root cause.

Yeah, the problem it's the latest motherboards has a new DSDT causes the
linux firmware doesn't work correctly

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/63

------------------------------------------------------------------------
On 2017-06-21T05:33:21+00:00 pagetronic wrote:

Yes, also U have a sound problem that I suspect is also related to
ACPI/DSDT (power management, cracking sound)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/64

------------------------------------------------------------------------
On 2017-07-19T23:44:58+00:00 taijian wrote:

I have another laptop that is affected by this same problem, even though
I do not have an NVIDIA dGPU. My system is an Alienware 15R3
i7-7700HQ/RX470.

Observed problems from this thread:
+ ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
+ ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
+ trying to wake the dGPU after a 'hard' suspend will crash the graphical 
session

Also, amdgpu pm does not work and the dGPU does not auto-suspend. If I
force-suspend it via acpi_call, then the graphical session will crash
upon trying to re-wake it, as described in this thread.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/65

------------------------------------------------------------------------
On 2017-07-19T23:48:07+00:00 taijian wrote:

Created attachment 257617
acpidump Alienware 15R3

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/66

------------------------------------------------------------------------
On 2017-07-19T23:48:44+00:00 taijian wrote:

Created attachment 257619
dmidecode Alienware 15R3

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/67

------------------------------------------------------------------------
On 2017-10-31T21:16:02+00:00 remy.labene wrote:

The new debian 9.1 kernel adds the "irqbalance" package. The system
seems more stable.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/68

------------------------------------------------------------------------
On 2017-10-31T21:21:01+00:00 juan.cuzmar.s wrote:

(In reply to Remy LABENE from comment #68)
> The new debian 9.1 kernel adds the "irqbalance" package. The system seems
> more stable.

Can you please run uname -arm?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/69

------------------------------------------------------------------------
On 2017-10-31T22:42:41+00:00 remy.labene wrote:

Linux PRT-MSIGP72 4.9.0-4-amd64 #1 SMP Debian 4.9.51-1 (2017-09-28)
x86_64 GNU/Linux

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/70

------------------------------------------------------------------------
On 2017-10-31T22:48:22+00:00 remy.labene wrote:

Created attachment 260453
dmesg with irqbalance deamon

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/71

------------------------------------------------------------------------
On 2017-10-31T23:06:43+00:00 bruno.n.pagani wrote:

(In reply to Remy LABENE from comment #71)
> Created attachment 260453 [details]
> dmesg with irqbalance deamon

This is a boot with `acpi_osi=! "acpi_osi=Windows 2009"
acpi_rev_override=5`, so not sure what you are trying to say.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/72

------------------------------------------------------------------------
On 2017-10-31T23:22:58+00:00 remy.labene wrote:

Yes, the system is unstable without `acpi_osi=! "acpi_osi=Windows 2009"
acpi_rev_override=5`, but now bbswitch and nvidia driver works
perfectly, the webcam also.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/73

------------------------------------------------------------------------
On 2017-11-01T21:17:51+00:00 zackw wrote:

Created attachment 260457
dmesg with MS-16K2-based laptop (acpi_osi overrides in effect)

I'm also experiencing this problem with kernel 4.13 (Debian unstable) on
a shiny new laptop made by ZaReason, based on the MSI MS-16K2
motherboard (if the BIOS is to be believed, anyway, which I'm not 100%
sure of -- you'll see why when you look at the dmidecode output).
`acpi_osi=! acpi_osi="Windows 2009"` successfully works around the
problem, `acpi_rev_override=5` doesn't.

Without the workaround, logging in graphically and `lspci` after the
Nvidia chip is suspended will trigger the syndrome.  The NMI watchdog
reports a hang reading I/O registers, from inside the Nouveau power
management code.

[   26.252604] nouveau 0000:01:00.0: Refused to change power state, currently 
in D3
[   26.312496] nouveau 0000:01:00.0: Refused to change power state, currently 
in D3
[   26.312499] nouveau 0000:01:00.0: Refused to change power state, currently 
in D3
[   26.312501] nouveau 0000:01:00.0: DRM: resuming object tree...
[   47.206122] INFO: rcu_sched self-detected stall on CPU
[   47.206127]   7-...: (5249 ticks this GP) idle=9f6/140000000000001/0 
softirq=3582/3582 fqs=2624 
[   47.206127]    (t=5250 jiffies g=497 c=496 q=1129)
[   47.206129] NMI backtrace for cpu 7
[   47.206130] CPU: 7 PID: 611 Comm: systemd-logind Not tainted 4.13.0-1-amd64 
#1 Debian 4.13.10-1
[   47.206154] Hardware name: White Brand Company White Brand Product/MS-16K2, 
BIOS E16K2ID6.311 05/12/2017
[   47.206155] Call Trace:
[   47.206156]  <IRQ>
[   47.206159]  ? dump_stack+0x5c/0x85
[   47.206160]  ? nmi_cpu_backtrace+0xbf/0xd0
[   47.206161]  ? irq_force_complete_move+0x140/0x140
[   47.206162]  ? nmi_trigger_cpumask_backtrace+0xf4/0x120
[   47.206164]  ? rcu_dump_cpu_stacks+0x9c/0xd5
[   47.206165]  ? rcu_check_callbacks+0x7a9/0x8f0
[   47.206166]  ? update_wall_time+0x45d/0x720
[   47.206168]  ? tick_sched_do_timer+0x40/0x40
[   47.206169]  ? update_process_times+0x28/0x50
[   47.206169]  ? tick_sched_handle+0x23/0x60
[   47.206170]  ? tick_sched_timer+0x34/0x70
[   47.206171]  ? __hrtimer_run_queues+0xdc/0x220
[   47.206173]  ? hrtimer_interrupt+0xa6/0x1f0
[   47.206174]  ? smp_apic_timer_interrupt+0x34/0x50
[   47.206175]  ? apic_timer_interrupt+0x82/0x90
[   47.206175]  </IRQ>
[   47.206177]  ? ioread32+0x2b/0x30
[   47.206224]  ? nv04_timer_read+0x42/0x60 [nouveau]
[   47.206241]  ? nvkm_pmu_reset+0x67/0x160 [nouveau]
[   47.206250]  ? nvkm_subdev_preinit+0x2f/0x100 [nouveau]
[   47.206267]  ? nvkm_device_init+0x5d/0x260 [nouveau]
[   47.206282]  ? nvkm_udevice_init+0x41/0x60 [nouveau]
[   47.206291]  ? nvkm_object_init+0x3b/0x180 [nouveau]
[   47.206300]  ? nvkm_object_init+0xab/0x180 [nouveau]
[   47.206309]  ? nvkm_object_init+0xab/0x180 [nouveau]
[   47.206325]  ? nouveau_do_resume+0x3b/0xe0 [nouveau]
[   47.206341]  ? nouveau_pmops_runtime_resume+0x89/0x170 [nouveau]
[   47.206342]  ? pci_restore_standard_config+0x40/0x40
[   47.206343]  ? pci_pm_runtime_resume+0x73/0xa0
[   47.206345]  ? __rpm_callback+0xc1/0x1f0
[   47.206346]  ? pci_restore_standard_config+0x40/0x40
[   47.206347]  ? rpm_callback+0x1f/0x70
[   47.206348]  ? pci_restore_standard_config+0x40/0x40
[   47.206349]  ? rpm_resume+0x4af/0x6c0
[   47.206350]  ? evdev_ioctl_handler+0x72/0xb60 [evdev]
[   47.206352]  ? __pm_runtime_resume+0x47/0x70
[   47.206366]  ? nouveau_drm_ioctl+0x35/0xc0 [nouveau]
[   47.206368]  ? do_vfs_ioctl+0x9f/0x600
[   47.206369]  ? syscall_trace_enter+0x11a/0x2c0
[   47.206370]  ? SyS_ioctl+0x74/0x80
[   47.206371]  ? do_syscall_64+0x7c/0xf0
[   47.206372]  ? entry_SYSCALL64_slow_path+0x25/0x25
[   72.348354] watchdog: BUG: soft lockup - CPU#7 stuck for 22s! 
[systemd-logind:611]
[   72.348356] Modules linked in: ctr ccm rfcomm bnep binfmt_misc nls_ascii 
nls_cp437 vfat fat snd_hda_codec_hdmi intel_rapl x86_pkg_temp_thermal 
intel_powerclamp snd_hda_codec_realtek snd_hda_codec_generic arc4 coretemp 
kvm_intel msi_wmi sparse_keymap iwlmvm kvm irqbypass rtsx_usb_ms intel_cstate 
memstick mac80211 intel_uncore intel_rapl_perf joydev evdev nouveau i915 
iwlwifi snd_hda_intel efi_pstore serio_raw snd_hda_codec pcspkr snd_hda_core 
mxm_wmi snd_hwdep uvcvideo efivars ttm videobuf2_vmalloc videobuf2_memops 
cfg80211 btusb videobuf2_v4l2 snd_pcm hci_uart drm_kms_helper btrtl 
videobuf2_core btbcm btqca btintel snd_timer drm videodev mei_me snd iTCO_wdt 
media sg iTCO_vendor_support i2c_algo_bit soundcore shpchp mei 
intel_pch_thermal bluetooth ac battery drbg ansi_cprng wmi video ecdh_generic
[   72.348404]  rfkill intel_lpss_acpi intel_lpss tpm_crb acpi_pad acpi_als 
kfifo_buf button industrialio parport_pc ppdev lp parport efivarfs ip_tables 
x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb 
algif_skcipher af_alg dm_crypt dm_mod hid_generic usbhid sd_mod rtsx_usb_sdmmc 
mmc_core rtsx_usb mfd_core crct10dif_pclmul crc32_pclmul crc32c_intel 
ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd 
psmouse ahci libahci xhci_pci i2c_i801 libata xhci_hcd nvme alx mdio nvme_core 
scsi_mod usbcore usb_common fan thermal i2c_hid hid
[   72.348435] CPU: 7 PID: 611 Comm: systemd-logind Not tainted 4.13.0-1-amd64 
#1 Debian 4.13.10-1
[   72.348436] Hardware name: White Brand Company White Brand Product/MS-16K2, 
BIOS E16K2ID6.311 05/12/2017
[   72.348436] task: ffff8ccbfb333040 task.stack: ffff9ab2021b8000
[   72.348439] RIP: 0010:ioread32+0x2b/0x30
[   72.348439] RSP: 0018:ffff9ab2021bbb80 EFLAGS: 00000296 ORIG_RAX: 
ffffffffffffff10
[   72.348440] RAX: 00000000ffffffff RBX: ffff8ccbfa67f400 RCX: 0000000000000018
[   72.348441] RDX: ffff9ab20510a014 RSI: ffff9ab20510a014 RDI: ffff9ab205009410
[   72.348441] RBP: 00000000ffffffff R08: 0000000000000002 R09: ffff9ab2021bbb84
[   72.348441] R10: 0000000000000000 R11: 00000000000003d1 R12: 00000000ffffffff
[   72.348442] R13: ffffffffffffffff R14: ffff8ccbfa35bf00 R15: ffff9ab2021bbde0
[   72.348442] FS:  00007fe0536b2a00(0000) GS:ffff8ccc0edc0000(0000) 
knlGS:0000000000000000
[   72.348443] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   72.348443] CR2: 00007fb9eb2af1a0 CR3: 0000000475345000 CR4: 00000000003406e0
[   72.348444] Call Trace:
[   72.348465]  ? nv04_timer_read+0x42/0x60 [nouveau]
[   72.348481]  ? nvkm_pmu_reset+0x67/0x160 [nouveau]
[   72.348491]  ? nvkm_subdev_preinit+0x2f/0x100 [nouveau]
[   72.348506]  ? nvkm_device_init+0x5d/0x260 [nouveau]
[   72.348521]  ? nvkm_udevice_init+0x41/0x60 [nouveau]
[   72.348530]  ? nvkm_object_init+0x3b/0x180 [nouveau]
[   72.348538]  ? nvkm_object_init+0xab/0x180 [nouveau]
[   72.348547]  ? nvkm_object_init+0xab/0x180 [nouveau]
[   72.348562]  ? nouveau_do_resume+0x3b/0xe0 [nouveau]
[   72.348577]  ? nouveau_pmops_runtime_resume+0x89/0x170 [nouveau]
[   72.348578]  ? pci_restore_standard_config+0x40/0x40
[   72.348579]  ? pci_pm_runtime_resume+0x73/0xa0
[   72.348581]  ? __rpm_callback+0xc1/0x1f0
[   72.348581]  ? pci_restore_standard_config+0x40/0x40
[   72.348582]  ? rpm_callback+0x1f/0x70
[   72.348583]  ? pci_restore_standard_config+0x40/0x40
[   72.348584]  ? rpm_resume+0x4af/0x6c0
[   72.348586]  ? evdev_ioctl_handler+0x72/0xb60 [evdev]
[   72.348587]  ? __pm_runtime_resume+0x47/0x70
[   72.348600]  ? nouveau_drm_ioctl+0x35/0xc0 [nouveau]
[   72.348602]  ? do_vfs_ioctl+0x9f/0x600
[   72.348603]  ? syscall_trace_enter+0x11a/0x2c0
[   72.348604]  ? SyS_ioctl+0x74/0x80
[   72.348605]  ? do_syscall_64+0x7c/0xf0
[   72.348606]  ? entry_SYSCALL64_slow_path+0x25/0x25
[   72.348607] Code: 48 81 ff ff ff 03 00 77 20 48 81 ff 00 00 01 00 76 05 0f 
b7 d7 ed c3 48 c7 c6 b0 b9 63 83 e8 2d ff ff ff b8 ff ff ff ff c3 8b 07 <c3> 0f 
1f 40 00 48 81 fe ff ff 03 00 48 89 f2 77 1f 48 81 fe 00 

Attached: dmesg (with acpi_osi overrides in effect).  Will shortly attach 
dmidecode and acpidump output.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/74

------------------------------------------------------------------------
On 2017-11-01T21:19:20+00:00 zackw wrote:

Created attachment 260459
dmidecode with MS-16K2-based laptop (acpi_osi overrides in effect)

dmidecode output for the same laptop described in previous comment.
Somebody forgot to fill in most of the OEM fields.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/75

------------------------------------------------------------------------
On 2017-11-01T21:19:50+00:00 zackw wrote:

Created attachment 260461
acpidump with MS-16K2-based laptop (acpi_osi overrides in effect)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/76

------------------------------------------------------------------------
On 2017-12-12T16:35:26+00:00 eurbah wrote:

Created attachment 261127
acpidump for MSI GE62 7RE-210FR

Kernel option 'acpi_osi=! acpi_osi="Windows 2009"' permits 'lspci' to
succeed, and 'nouveau' to successfully manage an external display with
resolution 3840 x 2160 at 60Hz through DisplayPort.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/77

------------------------------------------------------------------------
On 2017-12-24T14:04:18+00:00 rafadelboni wrote:

Same issue is present on Gigabyte Aero 15X (Core i7-7700HQ and NVIDIA
GeForce GTX 1070 MaxQ) with kernel 4.10.0-42-generic.

Putting acpi_osi=! acpi_osi="Windows 2009" in the Grub worked around the
problem, no major functionality is lost, but screen dimming shortcuts.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/78

------------------------------------------------------------------------
On 2018-01-29T12:50:05+00:00 bruno.n.pagani wrote:

Current status is NEEDINFO, question is “from who”? If there is anything
we could provide to help debugging, please tell us.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/79

------------------------------------------------------------------------
On 2018-02-07T03:09:28+00:00 isaac wrote:

I can confirm that I'm having this issue with my MSI GL72 6QD (i5
6300HQ, GTX 950m), and can confirm that putting acpi_osi=!
acpi_osi="Windows 2009" in grub did fix the issue.

I'd be happy to provide debug logs if needed.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/80

------------------------------------------------------------------------
On 2018-03-17T02:54:55+00:00 eurbah wrote:

With Linux kernel 4.16.0-041600rc5 from 
http://kernel.ubuntu.com/~kernel-ppa/mainline :
-  'lspci' still fails to answer, and makes the whole machine freeze after some 
time.
-  Kernel option 'acpi_osi=! acpi_osi="Windows 2009"' is still a good 
workaround.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/81

------------------------------------------------------------------------
On 2018-03-17T03:45:45+00:00 hhfeuer wrote:

(In reply to Bruno Pagani from comment #79)
> Current status is NEEDINFO, question is “from who”? If there is anything we
> could provide to help debugging, please tell us.
Maybe time to sum up the info we have
- nvidia gpu fails to power on
- on runtime resume
? also on system suspend/resume?
? due to a loop in acpi method, does overriding that alone help?
- waiting for someting to come alive
little info, some questions.
"from whom" - anyone that can provide.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/82

------------------------------------------------------------------------
On 2018-03-17T03:52:18+00:00 hhfeuer wrote:

...and maybe try this first
https://bugs.acpica.org/show_bug.cgi?id=1333#c86
first to rule out any side effects.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/83

------------------------------------------------------------------------
On 2018-03-27T16:37:44+00:00 intelfx wrote:

(In reply to Maik Freudenberg from comment #83)
> ...and maybe try this first
> https://bugs.acpica.org/show_bug.cgi?id=1333#c86
> first to rule out any side effects.

As expected, that did not help.

BTW, T540p (GT730M) here, similar lockups happen in 50% of
suspend/resume cycles and sometimes just before power-down/reboot.
Surprisingly though, never during actual work.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/84

------------------------------------------------------------------------
On 2018-04-18T18:43:44+00:00 eurbah wrote:

With Linux kernel 4.17.0-041700rc1 from http://kernel.ubuntu.com
/~kernel-ppa/mainline, systematically :

-  Inside a Linux console, 'lspci' fails to answer, and makes the whole
machine immediately freeze.

-  Graphical login fails, and makes the whole machine immediately
freeze.

-  Kernel option 'acpi_osi=! acpi_osi="Windows 2009"' is still a good
workaround.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/85

------------------------------------------------------------------------
On 2018-04-29T00:55:19+00:00 izolyda wrote:

I hope this bug will get fixed soon enough. Tried everything on my
Alienware 15 R3 (NVidia 1060), nothing helps. Tried also the workaround
suggested above (Kernel option 'acpi_osi=! acpi_osi="Windows 2009"'),
but it won't fix the black screen after suspend.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/86

------------------------------------------------------------------------
On 2018-07-04T15:41:08+00:00 mailtowubo wrote:

got the same issue. ThinkPad T440 NVIDIA GF117

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/87

------------------------------------------------------------------------
On 2018-07-04T15:54:06+00:00 mailtowubo wrote:

Created attachment 277165
dmesg from thinkpad t440 laptop

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/88

------------------------------------------------------------------------
On 2018-07-24T14:33:32+00:00 robbie.bowman wrote:

I get the following error trying to install Arch linux on Alienware 15

MMIO read of 00000000 FAULT at 022554 [ IBUS ]

this halts the install process and leaves me in command shell, but I
have experienced no issues running ubuntu 18.04 with its GUI.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/89

------------------------------------------------------------------------
On 2018-07-25T10:04:24+00:00 bruno.n.pagani wrote:

@Robert Bowman: Well, a command shell is the standard Arch Linux
installation method, so not sure whether you actually have an issue.

Regarding the error message you are reporting, there is already a bug
report for it: https://bugs.freedesktop.org/show_bug.cgi?id=100423

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/90

------------------------------------------------------------------------
On 2018-07-26T11:15:10+00:00 victor wrote:

Got the same issue on my HP ZBook Studio G5 x360 with a Nvidia Quadro
P1000 (Mobile) GPU using the proprietary nvidia drivers on kernel
4.17.9-1-ARCH, the workaround of setting kernel options to 'acpi_osi=!
acpi_osi="Windows 2009"' works.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/91

------------------------------------------------------------------------
On 2018-09-03T12:05:22+00:00 hhfeuer wrote:

Since this issue seems to get bigger, more recent notebooks by ASUS, 
Dell/Alienware, Toshiba seem affected, I compared those acpi tables with the 
unaffected one from my notebook and noticed this in the _ON method of the PEGP 
scope:
            TREN = One
            LNKD = Zero
            While (LNKS < 0x07)
            {

If I counted the bits correctly, this is related to the root port to which the 
nvidia gpu is connected with
TREN=link retrain bit
LNKD=link disable bit
LNKS=link width
So it's setting the link to retrain, enables it and then waits for it to reach 
x8 width. On affected machines, I couldn't find it to set the retrain bit, is 
that what changed in Windows 10, always retraining so the acpi depends on that?
Maybe someone could check by patching the kernel to set the bit in the resume 
iter of the pci driver?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/92

------------------------------------------------------------------------
On 2018-09-03T12:43:57+00:00 karolherbst wrote:

(In reply to Maik Freudenberg from comment #92)
> Since this issue seems to get bigger, more recent notebooks by ASUS,
> Dell/Alienware, Toshiba seem affected, I compared those acpi tables with the
> unaffected one from my notebook and noticed this in the _ON method of the
> PEGP scope:
>             TREN = One
>             LNKD = Zero
>             While (LNKS < 0x07)
>             {
> 
> If I counted the bits correctly, this is related to the root port to which
> the nvidia gpu is connected with
> TREN=link retrain bit
> LNKD=link disable bit
> LNKS=link width
> So it's setting the link to retrain, enables it and then waits for it to
> reach x8 width. On affected machines, I couldn't find it to set the retrain
> bit, is that what changed in Windows 10, always retraining so the acpi
> depends on that?
> Maybe someone could check by patching the kernel to set the bit in the
> resume iter of the pci driver?

okay, interesting, but I don't think this is actually causing issues.
While I was digging around some nouveau code, I was pinpointing the
devinit scripts we run to initialize those GPUs to be the first action
that runtime suspend+resume issues occur.

On my laptop the GPU simply doesn't respond to any PCIe request made.
Anyway, what those scripts do is to lower the PCIe link speed from 8.0
(what the GPU boots into) to 2.5 as one of the first actions.

Not doing that or setting the link to 8.0 "fixes" the issues for me:
https://github.com/karolherbst/linux/commit/08936d832bb3505d9431912d8be03796d71f55b1.patch

I would be very interested to know for how many other laptops this patch
helps as well. I noticed though that if I run the secboot bits with this
patch, resuming can still fail afterwards, but I didn't look deeper on
why that happens, as this is quite problematic to pinpoint.

But in the end, I think what is the root cause of those issues is that
the Host and the GPU simply disagree about the state of the PCIe link
and then things just randomly fail.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/93

------------------------------------------------------------------------
On 2018-09-03T12:48:07+00:00 karolherbst wrote:

Created attachment 278269
pcie link issue workaround

better to attach the patch

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/94

------------------------------------------------------------------------
On 2018-09-03T12:50:54+00:00 karolherbst wrote:

another note: in case the resume fails, the ACPI code wasn't able to
read out the GPUs state as well, so most likely all values are garbage
and simple contain -1, which would mean that LNKS will never be below
0x07 anyway.

So everything I was reading out via ACPI returned 0xffff, but maybe
that's just the case for me (Dell XPS 9560).

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/95

------------------------------------------------------------------------
On 2018-09-03T13:27:47+00:00 hhfeuer wrote:

If I understand correctly, that patch only applies to Pascal, any reason to 
leave Maxwell out?
In my case LNKS/TREN/LNKD worked on the config space of the upstream port, 
namely
00:01.1 PCI bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor 
PCI Express x8 Controller (rev 06)
and not the GPU.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/96

------------------------------------------------------------------------
On 2018-09-03T13:39:10+00:00 karolherbst wrote:

(In reply to Maik Freudenberg from comment #96)
> If I understand correctly, that patch only applies to Pascal, any reason to
> leave Maxwell out?

it also affects maxwell, but we had the PCIe stuff already wired up there, just 
not for Pascal.
The more important part of that patch is the nvkm_pcie_fini function which gets 
called before runtime suspending the GPU.

> In my case LNKS/TREN/LNKD worked on the config space of the upstream port,
> namely
> 00:01.1 PCI bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor
> PCI Express x8 Controller (rev 06)
> and not the GPU.

oh, I see. But in any case, if the upstream port <-> GPU communication
is broken, we can't rely on anything. And this was the situation I
noticed on my machine.

I am mainly interested if setting the PCIe link to the "default" state
from the GPU perspective is a working workaround and my hope is, that
this will also give us better insights on what the real cause is.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/97

------------------------------------------------------------------------
On 2018-09-03T13:54:11+00:00 hhfeuer wrote:

Which kernel versions can this patch be applied on?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/98

------------------------------------------------------------------------
On 2018-09-03T14:03:08+00:00 karolherbst wrote:

(In reply to Maik Freudenberg from comment #98)
> Which kernel versions can this patch be applied on?

I think all of the most recent ones should just work? The branch the
patch is from was 4.17 based, but it may also apply on older/newer
kernels as well.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/99

------------------------------------------------------------------------
On 2018-09-20T06:00:48+00:00 josh.farwell wrote:

Karol, I tried your kernel patch and the results were very promising. No
more kernel lockups!!

My computer is a Gigabyte Aero 15X v8 with an i7-8750H and a GTX 1070. I
am running Fedora 28 with kernel version 4.18.

I have been experiencing the hard kernel lockups when nouveau is loaded
and the GPU has entered D3 power state. Running `lspci` or trying to
suspend the machine locks it up, as do other programs such as the Power
Manager in GNOME or the Steam client. Trying to unload the nouveau
driver once it has been loaded also results in a kernel lockup. I can
use the acpi_osi="Windows 2009" workaround but then the nouveau driver
seems to never put the card into the low power state.

My use case is infrequent CUDA and gaming, so my desire is to use the
proprietary drivers when I need them and turn the card off when I don't.
I am trying to use nouveau as a workaround to power off the card, as the
older methods (bbswitch) also give me kernel lockups. I am using the
current draw from tlp-stat to figure out when the card is on or off.
Luckily, it draws almost an amp(!) so it's easy to tell.

With the PCIe link speed patch applied to nouveau, the kernel lockup
issues disappear under certain conditions. If I load nouveau during boot
and run X on the Intel card, the card never turns off when it isn't in
use, and xorg-x11-drv-nouveau reports a crash after a while. However, if
I load nouveau *after* X has started up, it does power down the card and
seems to be stable.

Suspend and resume works. Unloading the nouveau module works. Running
lspci works.

I am getting some interesting results when I run lspci. Instead of a
hard kernel lockup, the lspci stops and "thinks" for a moment
corresponding with an increase in current draw. This is indicating to me
that the card is turning back on when something tries to get a response
from it. After some seconds, nouveau will turn the card off again.

I can dynamically load and unload both the nvidia module and nouveau,
which makes this suitable workaround for me. I am curious if setting the
link speed to 8.0 would make bbswitch work, and may try it as an
experiment.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/100

------------------------------------------------------------------------
On 2018-09-20T21:44:38+00:00 mark wrote:

Josh I have also applied the "PCIE link speed patch" on the exact same
Gigabyte hardware.  I patched Kernel version 4.19 RC4.  I allow the
Nouveau module to load normally on boot and it effectively blocks the
NVIDIA driver load.  The journalctl dump seems to confirm all of this is
prior to X starting, but I do not have the subsequent crash that you
noted.  I have been able to remove the acpi_osi=! and acpi_osi="Windows
2009" options and do not have any repeatable lock-ups associated with
lspci.  I have observed the same behavior where it does pause briefly
when lspci or inxi -G is executed as if waiting on response from the
NVIDIA device.

As for power usage, I find it difficult to confirm how much power the
NVIDIA hardware is drawing and whether it is ever fully off, but I can
confirm that I see 30-40 W total draw with the NVIDIA driver in control
and only 13-20 W with nouveau.  In both cases I am just running powertop
and have allowed the post boot activities to quiet down so there is no
significant load on the system.  I don't do anything special with when I
load nouveau. I was also unable to get bbswitch to work and have removed
it and moved on.

Karol you seemed to imply this patch was more a tool to investigate as
opposed to a fix, so if there is anything I can do to help with further
diagnostics I will be happy to do so. I don't really understand this
code so I will need specifics.  I also am not seeing any adverse
behavior during suspend/resume or normal shutdown.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/101

------------------------------------------------------------------------
On 2018-09-26T08:38:40+00:00 hhfeuer wrote:

While investigating, I found an anomaly in the form of an Alienware 13 R2. This 
device also fails to properly initialize the nvidia device after resume. The 
gpu is connected to a pci bridge
00:1c.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #1 
(rev f1) (prog-if 00 [Normal decode])

This was limited by the manufacturer to gen2 (5GT/s) speed:
LnkCap: Port #1, Speed 5GT/s, Width x4, ASPM not supported

Thus, this device can never reach 8GT/s so the patch doesn't have any effect.
The only noticeable difference on the gpu is:
01:00.0 3D controller: NVIDIA Corporation GM107M [GeForce GTX 960M] (rev a2)
after boot:
LnkCap: Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, 
L1 <4us
after resume:
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, 
L1 <4us
The bridge stays the same, though, at LinkCap 5GT/s

Yet, the acpi still contains
While (LNKS < 0x07)
    {

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/102

------------------------------------------------------------------------
On 2018-09-26T08:53:34+00:00 hhfeuer wrote:

(continued)
Yet, the acpi still contains
While (LNKS < 0x07)
    {

So the question is, what LNKS really is that needs to be 0111. On that device, 
it can be neither speed nor width?
There are also other notebooks that fail to work after resume where the gpu is 
connected to the same x4 bridge
Intel Corporation Sunrise Point-LP PCI Express Root Port
like
Asus R558U
ASUS zenbook ux310uq
Asus X705UQ
Asus x556
Alpha Centurion Ultra
MSI cx62
HP Spectre
Toshiba Satellite Pro a50
...
but on those, the bridge is not limited to gen2.
So is really something wrong with that bridge?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/103

------------------------------------------------------------------------
On 2018-09-26T08:54:44+00:00 hhfeuer wrote:

Created attachment 278771
acpidump from anomalous Alienware 13 R2

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/104

------------------------------------------------------------------------
On 2018-09-26T09:03:39+00:00 hhfeuer wrote:

Created attachment 278773
pci driver quirk to re-train link on resume

Just for completeness, attaching a hacky driver quirk patch to re-train the 
link on resume. On my otherwise unaffected notebook, this raises the link speed 
of the gpu from 0001 (2.5 GT/s) to 0003 (8GT/s) on resume.
Does not have any effect on the aforementioned Alienware, but at least it told 
that this device was already at the maximum reachable speed of 5GT/s.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/105

------------------------------------------------------------------------
On 2018-09-26T13:44:32+00:00 hhfeuer wrote:

> Asus R558U
> ASUS zenbook ux310uq
> Asus X705UQ
> Asus x556
> Alpha Centurion Ultra
> MSI cx62
> HP Spectre
> Toshiba Satellite Pro a50

Forgot, all those machines don't have any acpi switches, plain Windows
10-only machines without an acpi_osi workaround which doesn't really
improve the situation.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/106

------------------------------------------------------------------------
On 2018-09-28T23:36:34+00:00 karolherbst wrote:

(In reply to Maik Freudenberg from comment #105)
> Created attachment 278773 [details]
> pci driver quirk to re-train link on resume
> 
> Just for completeness, attaching a hacky driver quirk patch to re-train the
> link on resume. On my otherwise unaffected notebook, this raises the link
> speed of the gpu from 0001 (2.5 GT/s) to 0003 (8GT/s) on resume.
> Does not have any effect on the aforementioned Alienware, but at least it
> told that this device was already at the maximum reachable speed of 5GT/s.

yeah, this sounds correct. The thing seems to be that some GPUs actually
come up with 8GT/s, but some state saving or something makes it that the
bus actually thinks the device is at 2.5GT/s maybe even the PCI config
space of the device thinks so? Don't for sure, but it would explain
quite a lot, because if devices disagree on the link settings, nothing
is expected to work anyway.

Will test your quirk when I find some time to test it.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/107

------------------------------------------------------------------------
On 2018-09-29T16:48:43+00:00 hhfeuer wrote:

There are currently two paths I'm trying to follow:
1) unhandled chipset bug
>From my observations, the first batch of machines with this issue from back 
>when this bug report was created always had their gpu connected to a pci bridge
00:1c.0 PCI bridge [0604]: Intel Corporation Sunrise Point-H PCI Express Root 
Port #3 [8086:a112] (rev f1)
connected gpus were diferent models from Maxwell to Pascal..
The second batch of machines now all have their gpu connected to a pci bridge
00:1c.0 PCI bridge [0604]: Intel Corporation Sunrise Point-LP PCI Express Root 
Port #1 [8086:9d10] (rev f1)
connected gpus are mainly Maxwell 940MX but also 930M, 960M and even a Kepler 
mobile Quadro.
So my question to those people in this thread with @intel.com addresses, Lv 
Zheng,  Rafael J. Wysocki, are the errata sheets of those chipsets/controllers 
easily accessible to rule out a chipset bug?

2) State of config space and link on suspend
Like Karol mentioned state saving and config space, my next guess is when the 
kernel restores the config space on resume on bridge and gpu, maybe that is 
flawed already so nothing can really been done to get the communication right. 
Which would point to having to somehow sanitize the pci config on suspend, e.g. 
tuning the speed down to 2.5GT/s if that's not already been done. Takes some 
information gathering on the state of the pci registers on suspend beforehand.

Karol, only test the patch if you're bored, though the printed out values of 
the current speed settings might be of interest, I don't expect it to do 
anything.
If the devices disagree about the link state/setting, this should be fixable by 
triggering a link equalization though I didn't really observe that so far.

The insteresting point of the mentioned Alienware is that it's nailed
down to gen2, so at least any fancy gen3 capabilities (if there are any,
didn't look into it, yet) are out of the game.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/108

------------------------------------------------------------------------
On 2018-09-30T20:22:57+00:00 karolherbst wrote:

well, you can try out my patch on gen2 when you replace the 8_0 with 5_0 in the 
patch. Calls look like that: "nvkm_pcie_set_link(pci, NVKM_PCIE_SPEED_8_0, 16);
". 16 is for the amount of lanes, but we completely ignore this value anyway 
(there are some macbooks where the binary driver actually changes the width, 
but usually it seems to be super unstable on other chips).

My GPU is connected with a "00:01.0 PCI bridge [0604]: Intel Corporation
Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16)
[8086:1901] (rev 05)"

I seriously don't think it is only an issue on a few controllers, and
might even happen with any one.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/109

------------------------------------------------------------------------
On 2018-10-04T08:44:13+00:00 karolherbst wrote:

(In reply to Maik Freudenberg from comment #105)
> Created attachment 278773 [details]
> pci driver quirk to re-train link on resume
> 
> Just for completeness, attaching a hacky driver quirk patch to re-train the
> link on resume. On my otherwise unaffected notebook, this raises the link
> speed of the gpu from 0001 (2.5 GT/s) to 0003 (8GT/s) on resume.
> Does not have any effect on the aforementioned Alienware, but at least it
> told that this device was already at the maximum reachable speed of 5GT/s.

with that quirk it still fails and I get this inside dmesg:

[  280.463651] pci_raw_set_power_state: 66 callbacks suppressed
[  280.463657] nouveau 0000:01:00.0: Refused to change power state, currently 
in D3
[  280.524318] nouveau 0000:01:00.0: nvquirk: max speed: 16
[  280.524319] nouveau 0000:01:00.0: nvquirk: current speed: 16
[  280.524320] nouveau 0000:01:00.0: nvquirk: gpu current speed: 000f
[  280.656526] nouveau 0000:01:00.0: nvquirk: 2. max speed: 16
[  280.656530] nouveau 0000:01:00.0: nvquirk: 2. current speed: 16
[  280.656536] nouveau 0000:01:00.0: nvquirk: 2. gpu current speed: 000f
[  280.656547] nouveau 0000:01:00.0: quirk_nvidia_resume+0x0/0x150 took 129123 
usecs
[  280.656590] nouveau 0000:01:00.0: Refused to change power state, currently 
in D3
[  280.656594] nouveau 0000:01:00.0: DRM: couldn't wake up GPU!

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/110

------------------------------------------------------------------------
On 2018-10-04T08:45:37+00:00 karolherbst wrote:

allthough you can ignore the two top lines, that is nouveau being
stupid... maybe I should remove the silly runpm code inside nouveau and
test again (it still has the pre _PR3 code in it and runs it on all
cards)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/111

------------------------------------------------------------------------
On 2018-10-04T10:52:53+00:00 karolherbst wrote:

okay, tested with most of the runpm code removed inside nouveau:

first cycle:
Oct 04 10:52:14 kherbst.pingu kernel: nouveau 0000:01:00.0: nvquirk: max speed: 
16
Oct 04 10:52:14 kherbst.pingu kernel: nouveau 0000:01:00.0: nvquirk: current 
speed: 16
Oct 04 10:52:14 kherbst.pingu kernel: nouveau 0000:01:00.0: nvquirk: gpu 
current speed: 0001
Oct 04 10:52:15 kherbst.pingu kernel: nouveau 0000:01:00.0: nvquirk: 2. max 
speed: 16
Oct 04 10:52:15 kherbst.pingu kernel: nouveau 0000:01:00.0: nvquirk: 2. current 
speed: 16
Oct 04 10:52:15 kherbst.pingu kernel: nouveau 0000:01:00.0: nvquirk: 2. gpu 
current speed: 0001
Oct 04 10:52:15 kherbst.pingu kernel: nouveau 0000:01:00.0: 
quirk_nvidia_resume+0x0/0x150 took 128846 usecs

second cycle:
Oct 04 10:52:41 kherbst.pingu kernel: nouveau 0000:01:00.0: Refused to change 
power state, currently in D3
Oct 04 10:52:41 kherbst.pingu kernel: nouveau 0000:01:00.0: nvquirk: max speed: 
16
Oct 04 10:52:41 kherbst.pingu kernel: nouveau 0000:01:00.0: nvquirk: current 
speed: 16
Oct 04 10:52:41 kherbst.pingu kernel: nouveau 0000:01:00.0: nvquirk: gpu 
current speed: 000f <== different
Oct 04 10:52:41 kherbst.pingu kernel: nouveau 0000:01:00.0: nvquirk: 2. max 
speed: 16
Oct 04 10:52:41 kherbst.pingu kernel: nouveau 0000:01:00.0: nvquirk: 2. current 
speed: 16
Oct 04 10:52:41 kherbst.pingu kernel: nouveau 0000:01:00.0: nvquirk: 2. gpu 
current speed: 000f
Oct 04 10:52:41 kherbst.pingu kernel: nouveau 0000:01:00.0: 
quirk_nvidia_resume+0x0/0x150 took 128599 usecs

no idea how random it is, that I get one working cycle with the quirk.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/112

------------------------------------------------------------------------
On 2018-10-04T12:42:55+00:00 hhfeuer wrote:

The data you get is even more confusing,
>gpu current speed: 000f
simply means that the gpu is turned off, the config space contains all 0xff 
which is consistent with
>Refused to change power state, currently in D3
Which raises the question, how does your patch get around that by setting the 
speed on an otherwise dead device?
>nvquirk: current speed: 16
means the bus is reporting a speed of 8GT/s, though I found this is a somehow 
cached value, I don't really know when the kernel updates that, it doesn't 
change when changing the speed of the device.
>gpu current speed: 0001
>2. gpu current speed: 0001
means that before and after training, the gpu is at 2.5GT/s so training does 
not have any effect on your device.

Like with the cached bus speed, I noticed the kernel at some point
introduced a kind of config space caching, which leads to a 'zombie
mode'. In an unrelated bug where my gpu failed to power on, with a 3.x
kernel, the config space was always 0xff which is consistent, with 4.x,
it would sometimes be 0xff and sometimes present a cached config space
while the device always being off.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/113

------------------------------------------------------------------------
On 2018-10-05T14:49:52+00:00 karolherbst wrote:

(In reply to Maik Freudenberg from comment #113)
> The data you get is even more confusing,
> >gpu current speed: 000f
> simply means that the gpu is turned off, the config space contains all 0xff
> which is consistent with
> >Refused to change power state, currently in D3
> Which raises the question, how does your patch get around that by setting
> the speed on an otherwise dead device?

I set it before suspending the GPU.

> >nvquirk: current speed: 16
> means the bus is reporting a speed of 8GT/s, though I found this is a
> somehow cached value, I don't really know when the kernel updates that, it
> doesn't change when changing the speed of the device.
> >gpu current speed: 0001
> >2. gpu current speed: 0001
> means that before and after training, the gpu is at 2.5GT/s so training does
> not have any effect on your device.
> 
> Like with the cached bus speed, I noticed the kernel at some point
> introduced a kind of config space caching, which leads to a 'zombie mode'.
> In an unrelated bug where my gpu failed to power on, with a 3.x kernel, the
> config space was always 0xff which is consistent, with 4.x, it would
> sometimes be 0xff and sometimes present a cached config space while the
> device always being off.

yeah... might be related. I think we kind of have to do something before
touching the GPU after invoking the ACPI methods to power it back on.
The might be in some random hw state and we shouldn't touch the GPU
before we are sure that the PCIe link is somewhat in a sane state.

I know this all works more or less reliable without Nouveau loaded (in
fact, we have to run some script from the vbios with the help of some
firmware stored in the vbios as well, both signed and verified before
execution) and if I skip that, runtime suspend/resume works as well, but
the GPU isn't in a useable state for nouveau then.

There is some memory stuff going on, but also the PCIe configuration is
touched. It might be that we need some information from Nvidia for that
(already working on that), but maybe there is a nice solution without
their help?

Anyway, we touch the PCIe configuration when loading nouveau and we
might have to change something before suspending or do something special
on resuming.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/114

------------------------------------------------------------------------
On 2018-10-07T03:44:04+00:00 hhfeuer wrote:

Dear Raphael.
Due to the known problems with the bridge
00:1c.0 PCI bridge [0604]: Intel Corporation Sunrise Point-LP PCI Express Root 
Port #1 [8086:9d10] (rev f1)
being
https://bugzilla.kernel.org/show_bug.cgi?id=201069
https://bugzilla.kernel.org/show_bug.cgi?id=116851#c23
I renew my question for errata sheets.
Or is the answer to whether or not that information is confidential 
confidential?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/115

------------------------------------------------------------------------
On 2018-11-05T16:30:07+00:00 mfulz wrote:

Created attachment 279327
acpidump HP Omen 15 dc0307ng

I've checked the tables for PGON and found following in ssdt7:

        Method (PGON, 1, Serialized)
        {
            PION = Arg0
            If ((PION == Zero))
            {
                If ((SGGP == Zero))
                {
                    Return (Zero)
                }
            }
            ElseIf ((PION == One))
            {
                If ((P1GP == Zero))
                {
                    Return (Zero)
                }
            }
            ElseIf ((PION == 0x02))
            {
                If ((P2GP == Zero))
                {
                    Return (Zero)
                }
            }

            PEBA = \XBAS /* External reference */
            PDEV = GDEV (PION)
            PFUN = GFUN (PION)
            Name (SCLK, Package (0x03)
            {
                One,
                0x0100,
                Zero
            })
            If ((DerefOf (SCLK [Zero]) != Zero))
            {
                PCRA (0xDC, 0x100C, ~DerefOf (SCLK [One]))
                Sleep (0x10)
            }

            If ((CCHK (PION, One) == Zero))
            {
                Return (Zero)
            }

            GPPR (PION, One)
            RTEN (PION)
            If ((PBGE != Zero))
            {
                If (SBDL (PION))
                {
                    PUAB (PION)
                    CBDL = GUBC (PION)
                    MBDL = GMXB (PION)
                    If ((CBDL > MBDL))
                    {
                        CBDL = MBDL /* \_SB_.PCI0.MBDL */
                    }

                    PDUB (PION, CBDL)
                }
            }

            \_SB.PCI0.PEG0.LREN = \_SB.PCI0.PEG0.PEGP.LTRE
            \_SB.PCI0.PEG0.CEDR = One
            While ((\_SB.PCI0.PEG0.LNKS < 0x03))
            {
                Sleep (One)
            }

            If ((PION == Zero))
            {
                S0VI = H0VI /* \_SB_.PCI0.H0VI */
                S0DI = H0DI /* \_SB_.PCI0.H0DI */
                LCT0 = ((ELC0 & 0x43) | (LCT0 & 0xFFBC))
            }

The interesting part here: \_SB.PCI0.PEG0.LNKS < 0x03
As far as I understand it should be always 0x04 as it is set only once inside 
the ssdt12:

    Scope (\_SB.PCI0.PEG0)
    {
        OperationRegion (MSID, SystemMemory, EBAS, 0x0500)
        Field (MSID, DWordAcc, Lock, Preserve)
        {
            VEID,   16,
            Offset (0x40),
            NVID,   32,
            Offset (0x4C),
            ATID,   32,
            Offset (0x88),
            PASM,   2,
            Offset (0x48B),
                ,   1,
            NHDA,   1
        }

        OperationRegion (RPCX, SystemMemory, ((\XBAS + 0x8000) + Zero), 0x1000)
        Field (RPCX, ByteAcc, NoLock, Preserve)
        {
            Offset (0x04),
            CMDR,   8,
            Offset (0x19),
            PRBN,   8,
            Offset (0x84),
            D0ST,   2,
            Offset (0xAA),
            CEDR,   1,
            Offset (0xB0),
            ASPM,   2,
                ,   2,
            LNKD,   1,
            Offset (0xC9),
                ,   2,
            LREN,   1,
            Offset (0x216),
            LNKS,   4

Here is the only place I can find LNKS initialized to 4.

I'll try to remove the while loop for a test and compile a custom dsdt
for a test.

Question: Can some acpi expert tell me what this LNKS is for?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/116

------------------------------------------------------------------------
On 2018-11-05T22:15:29+00:00 mfulz wrote:

Seems that overloading ssdt7 is not working. Or I don't know how to do
it correct.

Does anybody has some hints, what I could try as workaround for this
issue?

Setting acpi_osi or rev didn't help

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/117

------------------------------------------------------------------------
On 2018-11-07T00:22:11+00:00 mfulz wrote:

Some more hints:

After reboot NVIDIA card off:
cat /proc/acpi/bbswitch                                                         

                                                 0000:01:00.0 OFF

lspci -> working

Second call to lspci -> complete freeze

Reboot:
Card off:
cat /proc/acpi/bbswitch                                                         

                                                 0000:01:00.0 OFF

optirun lspci -> working
optirun lspci -> working
optirun lspci -> working

Reboot:
Card off:
cat /proc/acpi/bbswitch                                                         

                                                 0000:01:00.0 OFF

lspci -> working

Manual cycle card power (inside acpidbg):
execute \_SB.PCI0.PGON
Evaluating \_SB.PCI0.PGON
4ACPI Warning: \_SB.PCI0.PGON: Insufficient arguments - Caller passed 0, method 
requires 1 (20180531/nsarguments-235)                                           

Evaluation of \_SB.PCI0.PGON returned object 000000003cba7e41, external buffer 
length 18
 [Integer] = 0000000000000000

execute \_SB.PCI0.PGOF
Evaluating \_SB.PCI0.PGOF
4ACPI Warning: \_SB.PCI0.PGOF: Insufficient arguments - Caller passed 0, method 
requires 1 (20180531/nsarguments-235)                                           

Evaluation of \_SB.PCI0.PGOF returned object 000000003cba7e41, external buffer 
length 18
 [Integer] = 0000000000000000

Card off:
cat /proc/acpi/bbswitch                                                         

                                                 0000:01:00.0 OFF

lspci -> working

When I'll cycle the card via ACPI calls before calling lspci everything works 
fine.

For me it looks somehow that lscpi, lshw, etc. are doing something or
better missing something which is done when calling the acpi methods
directly.

Does anybody know what exactly these tools are doing (or the kernel,
etc.) which  is different from the acpi calls?

Sorry if this all isn't very clear, I'm totally new to all this acpi
stuff and just trying to get around the freezes.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/118

------------------------------------------------------------------------
On 2018-11-07T00:24:33+00:00 mfulz wrote:

Sorry copy & paste the wrong terminal.
Here are the correct acpi calls:

- execute \_SB.PCI0.PGON Zero
Evaluating \_SB.PCI0.PGON
Evaluation of \_SB.PCI0.PGON returned object 000000003cba7e41, external buffer 
length 18
 [Integer] = 0000000000000000

- execute \_SB.PCI0.PGOF Zero
Evaluating \_SB.PCI0.PGOF
Evaluation of \_SB.PCI0.PGOF returned object 000000003cba7e41, external buffer 
length 18
 [Integer] = 0000000000000000

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/119

------------------------------------------------------------------------
On 2018-11-08T00:59:19+00:00 karolherbst wrote:

(In reply to Matthias Fulz from comment #118)
> 
> Does anybody know what exactly these tools are doing (or the kernel, etc.)
> which  is different from the acpi calls?
> 

yes, they do a more fine grained suspending process:
1. put GPU into S3 state via PCI config space
2. ACPI call to suspend the GPU
3. ACPI call to suspend the bus

but that's not actually what is triggering the issue. I was digging a
bit into what Nouveau is doing before suspending and it turns out that
when you invoke some parts of the vbios you have to run in order to
fully use the GPU, some scripts you push through signed firmware
embedded inside the vbios on the GPU, it touches the PCI link settings
which causes the resume process to fail. More or less.

The bigger problem here is, we don't have anything to "revert" what this
embedded script touches and those scripts are quite huge, so we need to
come up with something working for all GPUs after that was executed.

I had a small hack to change the PCIe link speed back to the boot
default which fixed the issue a little bit, but caused issues later on.

I am currently in touch with Nvidia about that issue and we might get
some more information on that in the future, which would help us to fix
this issue completly.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/120

------------------------------------------------------------------------
On 2018-11-08T08:33:56+00:00 mfulz wrote:

Thanks for the Info.

If this is any help: My setup is running nvidia binary drivers / bumblebee.
I saw your patch but because I blacklisted nouveau, I'm unable to try it out.

Yesterday I ran into the issue and got messages on these two ACPI
methods, if this could be any help:

\_SB.PCI0.PEG0.PEGP._ON, AE_AML_LOOP_TIMEOUT
\_SB.PCI0.PEG0.PEGP._PS0, AE_AML_LOOP_TIMEOUT

Will try to debug them later at home a bit.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/121

------------------------------------------------------------------------
On 2018-11-12T06:28:10+00:00 houkime wrote:

+1 affected machine to the piggybank of knowledge

Asus Vivobook Pro N580GD-FI110
Intel Core i5,
Nvidia GTX1050M

Simptoms:
* Fails to reboot, shutdown or suspend-resume.
* lspci sometimes completes after some time, but makes touchpad and/or keyboard 
non-functional
* systemd output on reboot shows a bunch of processes being endlessly caught in 
soft and sometimes hard lockups

Information collection is limited because it is my friend's laptop which
he currently uses and not mine.

For now the problem was solved by installing nvidia proprietary, then deleting 
xf86-video-nouveau (Arch-based system) and blacklisting nouveau kernel module.
(Not sure it is ethical to persuade to go back to nouveau to do more tests)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/122

------------------------------------------------------------------------
On 2018-11-23T00:05:22+00:00 mfulz wrote:

Another information:
I'm trying nouveau with nvidia PRIME and there are no lockups with this setup.

The issue I've here is now that the powerconsumption is with dicrete
card render offload very high around 15W compared to switch off the card
via acpi 7W.

cat /sys/kernel/debug/vgaswitcheroo/switch                                      

                                                  0:DIS: :DynOff:0000:01:00.0
1:IGD:+:Pwr:0000:00:02.0

 xrandr --listproviders                                                         

Providers: number : 2
Provider 0: id: 0x8d cap: 0xb, Source Output, Sink Output, Sink Offload crtcs: 
4 outputs: 4 associated providers: 1 name:Intel
Provider 1: id: 0x65 cap: 0x7, Source Output, Sink Output, Source Offload 
crtcs: 4 outputs: 2 associated providers: 1 name:nouveau

I can switch off the card completely via acpi_call but then the lockups
are popping up again...

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/127

------------------------------------------------------------------------
On 2018-11-23T08:34:58+00:00 AlexanderMangaard wrote:

Created attachment 279627
attachment-17158-0.html

Has anyone else noticed that whilst GRUB boot parameters seem to resolve
the problem (perhaps with some decrease in performance) the system will
once again freeze upon resuming from a suspend (closing laptop lid)?

On Nov 23, 2018 01:05, <bugzilla-dae...@bugzilla.kernel.org> wrote:

https://bugzilla.kernel.org/show_bug.cgi?id=156341

--- Comment #123 from Matthias Fulz (mf...@olznet.de) ---
Another information:
I'm trying nouveau with nvidia PRIME and there are no lockups with this
setup.

The issue I've here is now that the powerconsumption is with dicrete card
render offload very high around 15W compared to switch off the card via acpi
7W.

cat /sys/kernel/debug/vgaswitcheroo/switch

                                                  0:DIS:
:DynOff:0000:01:00.0
1:IGD:+:Pwr:0000:00:02.0

 xrandr --listproviders

Providers: number : 2
Provider 0: id: 0x8d cap: 0xb, Source Output, Sink Output, Sink Offload
crtcs:
4 outputs: 4 associated providers: 1 name:Intel
Provider 1: id: 0x65 cap: 0x7, Source Output, Sink Output, Source Offload
crtcs: 4 outputs: 2 associated providers: 1 name:nouveau

I can switch off the card completely via acpi_call but then the lockups are
popping up again...

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/128

------------------------------------------------------------------------
On 2018-11-23T21:07:43+00:00 mfulz wrote:

(In reply to Alexander from comment #124)
> Created attachment 279627 [details]
> attachment-17158-0.html
> 
> Has anyone else noticed that whilst GRUB boot parameters seem to resolve
> the problem (perhaps with some decrease in performance) the system will
> once again freeze upon resuming from a suspend (closing laptop lid)?
> 

I can confirm this issue.
Supending even with working osi settings is going to cause the freeze on resume.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803179/comments/130

** Changed in: linux
       Status: Unknown => Incomplete

** Changed in: linux
   Importance: Unknown => Medium

** Bug watch added: github.com/Bumblebee-Project/Bumblebee/issues #764
   https://github.com/Bumblebee-Project/Bumblebee/issues/764

** Bug watch added: github.com/Bumblebee-Project/bbswitch/issues #134
   https://github.com/Bumblebee-Project/bbswitch/issues/134

** Bug watch added: bugzilla.opensuse.org/ #1022443
   https://bugzilla.opensuse.org/show_bug.cgi?id=1022443

** Bug watch added: Linux Kernel Bug Tracker #194431
   https://bugzilla.kernel.org/show_bug.cgi?id=194431

** Bug watch added: github.com/Bumblebee-Project/bbswitch/issues #142
   https://github.com/Bumblebee-Project/bbswitch/issues/142

** Bug watch added: github.com/Bumblebee-Project/bbswitch/issues #137
   https://github.com/Bumblebee-Project/bbswitch/issues/137

** Bug watch added: github.com/Bumblebee-Project/bbswitch/issues #148
   https://github.com/Bumblebee-Project/bbswitch/issues/148

** Bug watch added: bugs.acpica.org/ #1333
   https://bugs.acpica.org/show_bug.cgi?id=1333

** Bug watch added: freedesktop.org Bugzilla #100423
   https://bugs.freedesktop.org/show_bug.cgi?id=100423

** Bug watch added: Linux Kernel Bug Tracker #201069
   https://bugzilla.kernel.org/show_bug.cgi?id=201069

** Bug watch added: Linux Kernel Bug Tracker #116851
   https://bugzilla.kernel.org/show_bug.cgi?id=116851

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1803179

Title:
  System does not reliably come out of suspend

To manage notifications about this bug go to:
https://bugs.launchpad.net/linux/+bug/1803179/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1803179] Re: System does not reliably come out of suspend

Reply via email to