from:"Rafael J. Wysocki"

Re: [Nouveau] [PATCH v5 02/31] drm/i915: Don't register backlight when another backlight should be used (v2)

2022-10-27 Thread Rafael J. Wysocki

On Thu, Oct 27, 2022 at 2:17 PM Hans de Goede  wrote:
>
> Hi,
>
> On 10/27/22 14:09, Rafael J. Wysocki wrote:
> > On Thu, Oct 27, 2022 at 12:37 PM Hans de Goede  wrote:
> >>
> >> Hi,
> >>
> >> On 10/27/22 11:52, Matthew Garrett wrote:
> >>> On Thu, Oct 27, 2022 at 11:39:38AM +0200, Hans de Goede wrote:
> >>>
> >>>> The *only* behavior which actually is new in 6.1 is the native GPU
> >>>> drivers now doing the equivalent of:
> >>>>
> >>>>  if (acpi_video_get_backlight_type() != acpi_backlight_native)
> >>>>  return;
> >>>>
> >>>> In their backlight register paths (i), which is causing the native
> >>>> backlight to disappear on your custom laptop setup and on Chromebooks
> >>>> (with the Chromebooks case being already solved I hope.).
> >>>
> >>> It's causing the backlight control to vanish on any machine that isn't
> >>> ((acpi_video || vendor interface) || !acpi). Most machines that fall
> >>> into that are either weird or Chromebooks or old, but there are machines
> >>> that fall into that.
> >>
> >> I acknowledge that their are machines that fall into this category,
> >> but I expect / hope there to be so few of them that we can just DMI
> >> quirk our way out if this.
> >>
> >> I believe the old group to be small because:
> >>
> >> 1. Generally speaking the "native" control method is usually not
> >> present on the really old (pre ACPI video spec) mobile GPUs.
> >>
> >> 2. On most old laptops I would still expect there to be a vendor
> >> interface too, and if both get registered standard desktop environments
> >> will prefer the vendor one, so then we need a native DMI quirk to
> >> disable the vendor interface anyways and we already have a bunch of
> >> those, so some laptops in this group are already covered by DMI quirks.
> >>
> >> And a fix for the Chromebook case is already in Linus' tree, which
> >> just leaves the weird case, of which there will hopefully be only
> >> a few.
> >>
> >> I do share your worry that this might break some machines, but
> >> the only way to really find out is to get this code out there
> >> I'm afraid.
> >>
> >> I have just written a blog post asking for people to check if
> >> their laptop might be affected; and to report various details
> >> to me of their laptop is affected:
> >>
> >> https://hansdegoede.dreamwidth.org/26548.html
> >>
> >> Lets wait and see how this goes. If I get (too) many reports then
> >> I will send a revert of the addition of the:
> >>
> >> if (acpi_video_get_backlight_type() != acpi_backlight_native)
> >> return;
> >>
> >> check to the i915 / radeon / amd / nouveau drivers.
> >>
> >> (And if I only get a couple of reports I will probably just submit
> >> DMI quirks for the affected models).
> >
> > Sounds reasonable to me, FWIW.
> >
> > And IIUC the check above can be overridden by passing
> > acpi_backlight=native in the kernel command line, right?
>
> Right, that can be used as a quick workaround, but we really do
> want this to work OOTB everywhere.

Sure.

My point is that if it doesn't work OOTB for someone, and say it used
to, they can use the above workaround and report back.

Re: [Nouveau] [PATCH v5 02/31] drm/i915: Don't register backlight when another backlight should be used (v2)

2022-10-27 Thread Rafael J. Wysocki

On Thu, Oct 27, 2022 at 12:37 PM Hans de Goede  wrote:
>
> Hi,
>
> On 10/27/22 11:52, Matthew Garrett wrote:
> > On Thu, Oct 27, 2022 at 11:39:38AM +0200, Hans de Goede wrote:
> >
> >> The *only* behavior which actually is new in 6.1 is the native GPU
> >> drivers now doing the equivalent of:
> >>
> >>  if (acpi_video_get_backlight_type() != acpi_backlight_native)
> >>  return;
> >>
> >> In their backlight register paths (i), which is causing the native
> >> backlight to disappear on your custom laptop setup and on Chromebooks
> >> (with the Chromebooks case being already solved I hope.).
> >
> > It's causing the backlight control to vanish on any machine that isn't
> > ((acpi_video || vendor interface) || !acpi). Most machines that fall
> > into that are either weird or Chromebooks or old, but there are machines
> > that fall into that.
>
> I acknowledge that their are machines that fall into this category,
> but I expect / hope there to be so few of them that we can just DMI
> quirk our way out if this.
>
> I believe the old group to be small because:
>
> 1. Generally speaking the "native" control method is usually not
> present on the really old (pre ACPI video spec) mobile GPUs.
>
> 2. On most old laptops I would still expect there to be a vendor
> interface too, and if both get registered standard desktop environments
> will prefer the vendor one, so then we need a native DMI quirk to
> disable the vendor interface anyways and we already have a bunch of
> those, so some laptops in this group are already covered by DMI quirks.
>
> And a fix for the Chromebook case is already in Linus' tree, which
> just leaves the weird case, of which there will hopefully be only
> a few.
>
> I do share your worry that this might break some machines, but
> the only way to really find out is to get this code out there
> I'm afraid.
>
> I have just written a blog post asking for people to check if
> their laptop might be affected; and to report various details
> to me of their laptop is affected:
>
> https://hansdegoede.dreamwidth.org/26548.html
>
> Lets wait and see how this goes. If I get (too) many reports then
> I will send a revert of the addition of the:
>
> if (acpi_video_get_backlight_type() != acpi_backlight_native)
> return;
>
> check to the i915 / radeon / amd / nouveau drivers.
>
> (And if I only get a couple of reports I will probably just submit
> DMI quirks for the affected models).

Sounds reasonable to me, FWIW.

And IIUC the check above can be overridden by passing
acpi_backlight=native in the kernel command line, right?

Re: [Nouveau] [PATCH 1/4] ACPI: OSI: Remove Linux-Dell-Video _OSI string

2022-08-25 Thread Rafael J. Wysocki

On Wed, Aug 24, 2022 at 8:28 PM Limonciello, Mario
 wrote:
>
> [Public]
>
>
>
> > -Original Message-
> > From: Kai-Heng Feng 
> > Sent: Wednesday, August 24, 2022 09:17
> > To: Limonciello, Mario 
> > Cc: raf...@kernel.org; Len Brown ;
> > nouveau@lists.freedesktop.org; hdego...@redhat.com;
> > dda...@nvidia.com; dell.client.ker...@dell.com; kher...@redhat.com;
> > Lyude Paul ; linux-a...@vger.kernel.org; linux-
> > ker...@vger.kernel.org
> > Subject: Re: [PATCH 1/4] ACPI: OSI: Remove Linux-Dell-Video _OSI string
> >
> > On Wed, Aug 24, 2022 at 2:51 AM Mario Limonciello
> >  wrote:
> > >
> > > This string was introduced because drivers for NVIDIA hardware
> > > had bugs supporting RTD3 in the past.  Thoes bugs have been fixed
> > > by commit 5775b843a619 ("PCI: Restore config space on runtime resume
> > > despite being unbound"). so vendors shouldn't be using this string
> > > to modify ASL anymore.
> >
> > Add some backgrounds on what happened.
> >
> > Before proprietary NVIDIA driver supports RTD3, Ubuntu has a mechanism
> > that can switch PRIME on and off, though it requires to logout/login
> > to make the library switch happen.
> > When the PRIME is off, the mechanism unload NVIDIA driver and put the
> > device to D3cold, but GPU never came back to D0 again. So ODM use the
> > _OSI to expose an old _DSM method to switch the power on/off.
> >
> > The issue is fixed by the said commit so we can discard the workaround now.
> >
>
> Thanks for that.  If this series needs to spin I'll roll that into the commit 
> message.
> Otherwise perhaps Rafael can pick up some of it if he thinks it makes sense 
> to include.

I've applied the series (as 6.1 material) and included the information
above into the changelog of the first patch.  I've also edited the
changelogs of the other patches somewhat.

Thanks!

Re: [Nouveau] [PATCH v2 00/29] drm/kms: Stop registering multiple /sys/class/backlight devs for a single display

2022-07-14 Thread Rafael J. Wysocki

On Tue, Jul 12, 2022 at 9:39 PM Hans de Goede  wrote:
>
> Hi All,
>
> As mentioned in my RFC titled "drm/kms: control display brightness through
> drm_connector properties":
> https://lore.kernel.org/dri-devel/0d188965-d809-81b5-74ce-7d30c49fe...@redhat.com/
>
> The first step towards this is to deal with some existing technical debt
> in backlight handling on x86/ACPI boards, specifically we need to stop
> registering multiple /sys/class/backlight devs for a single display.
>
> This series implements my RFC describing my plan for these cleanups:
> https://lore.kernel.org/dri-devel/98519ba0-7f18-201a-ea34-652f50343...@redhat.com/
>
> This new version addresses the few small remarks made on version 1 (mainly
> changing patch 1/29) and more importantly this finishes the refactoring by
> else addressing all the bits from the "Other issues" section of
> the refactor RFC (resulting in patches 15-29 which are new in v2).
>
> Please review and test! I hope to be able to make an immutable branch
> based on 5.20-rc1 + this series available for merging into the various
> touched subsystems once 5.20-rc2 is out.

Please feel free to add

Acked-by: Rafael J. Wysocki 

to all of the ACPI video patches in this series.

Thanks!

> Hans de Goede (29):
>   ACPI: video: Add acpi_video_backlight_use_native() helper
>   drm/i915: Don't register backlight when another backlight should be
> used
>   drm/amdgpu: Don't register backlight when another backlight should be
> used
>   drm/radeon: Don't register backlight when another backlight should be
> used
>   drm/nouveau: Don't register backlight when another backlight should be
> used
>   ACPI: video: Drop backlight_device_get_by_type() call from
> acpi_video_get_backlight_type()
>   ACPI: video: Remove acpi_video_bus from list before tearing it down
>   ACPI: video: Simplify acpi_video_unregister_backlight()
>   ACPI: video: Make backlight class device registration a separate step
>   ACPI: video: Remove code to unregister acpi_video backlight when a
> native backlight registers
>   drm/i915: Call acpi_video_register_backlight() (v2)
>   drm/nouveau: Register ACPI video backlight when nv_backlight
> registration fails
>   drm/amdgpu: Register ACPI video backlight when skipping amdgpu
> backlight registration
>   drm/radeon: Register ACPI video backlight when skipping radeon
> backlight registration
>   ACPI: video: Refactor acpi_video_get_backlight_type() a bit
>   ACPI: video: Add Nvidia WMI EC brightness control detection
>   ACPI: video: Add Apple GMUX brightness control detection
>   platform/x86: apple-gmux: Stop calling acpi/video.h functions
>   platform/x86: toshiba_acpi: Stop using
> acpi_video_set_dmi_backlight_type()
>   platform/x86: acer-wmi: Move backlight DMI quirks to
> acpi/video_detect.c
>   platform/x86: asus-wmi: Drop DMI chassis-type check from backlight
> handling
>   platform/x86: asus-wmi: Move acpi_backlight=vendor quirks to ACPI
> video_detect.c
>   platform/x86: asus-wmi: Move acpi_backlight=native quirks to ACPI
> video_detect.c
>   platform/x86: samsung-laptop: Move acpi_backlight=[vendor|native]
> quirks to ACPI video_detect.c
>   ACPI: video: Remove acpi_video_set_dmi_backlight_type()
>   ACPI: video: Drop "Samsung X360" acpi_backlight=native quirk
>   ACPI: video: Drop Clevo/TUXEDO NL5xRU and NL5xNU acpi_backlight=native
> quirks
>   ACPI: video: Fix indentation of video_detect_dmi_table[] entries
>   drm/todo: Add entry about dealing with brightness control on devices
> with > 1 panel
>
>  Documentation/gpu/todo.rst|  68 +++
>  drivers/acpi/Kconfig  |   1 +
>  drivers/acpi/acpi_video.c |  59 ++-
>  drivers/acpi/video_detect.c   | 415 +++---
>  drivers/gpu/drm/Kconfig   |  12 +
>  .../gpu/drm/amd/amdgpu/atombios_encoders.c|  14 +-
>  .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |   9 +
>  drivers/gpu/drm/gma500/Kconfig|   2 +
>  drivers/gpu/drm/i915/Kconfig  |   2 +
>  .../gpu/drm/i915/display/intel_backlight.c|   7 +
>  drivers/gpu/drm/i915/display/intel_display.c  |   8 +
>  drivers/gpu/drm/i915/display/intel_panel.c|   3 +
>  drivers/gpu/drm/i915/i915_drv.h   |   2 +
>  drivers/gpu/drm/nouveau/nouveau_backlight.c   |  14 +
>  drivers/gpu/drm/radeon/atombios_encoders.c|   7 +
>  drivers/gpu/drm/radeon/radeon_encoders.c  |  11 +-
>  .../gpu/drm/radeon/radeon_legacy_encoders.c   |   7 +
>  drivers/platform/x86/acer-wmi.c   |  66 ---
>  drivers/platform/x86/apple-gmux.c |   3 -
>  dr

Re: [Nouveau] [PATCH][next] treewide: Replace zero-length arrays with flexible-array members

2022-02-20 Thread Rafael J. Wysocki

On Tue, Feb 15, 2022 at 8:24 PM Gustavo A. R. Silva
 wrote:
>
> On Tue, Feb 15, 2022 at 09:19:29PM +0200, Leon Romanovsky wrote:
> > On Tue, Feb 15, 2022 at 01:21:10PM -0600, Gustavo A. R. Silva wrote:
> > > On Tue, Feb 15, 2022 at 10:17:40AM -0800, Kees Cook wrote:
> > > > On Tue, Feb 15, 2022 at 11:47:43AM -0600, Gustavo A. R. Silva wrote:
> > > >
> > > > These all look trivially correct to me. Only two didn't have the end of
> > > > the struct visible in the patch, and checking those showed them to be
> > > > trailing members as well, so:
> > > >
> > > > Reviewed-by: Kees Cook 
> > >
> > > I'll add this to my -next tree.
> >
> > I would like to ask you to send mlx5 patch separately to netdev. We are 
> > working
> > to delete that file completely and prefer to avoid from unnecessary merge 
> > conflicts.
>
> Oh OK. Sure thing; I will do so.

Can you also send the ACPI patch separately, please?

We would like to route it through the upstream ACPICA code base.

[Nouveau] [PATCH v2 2/7] nouveau: ACPI: Use the ACPI_COMPANION() macro directly

2021-10-13 Thread Rafael J. Wysocki

From: Rafael J. Wysocki 

The ACPI_HANDLE() macro is a wrapper arond the ACPI_COMPANION()
macro and the ACPI handle produced by the former comes from the
ACPI device object produced by the latter, so it is way more
straightforward to evaluate the latter directly instead of passing
the handle produced by the former to acpi_bus_get_device().

Modify nouveau_acpi_edid() accordingly (no intentional functional
impact).

Signed-off-by: Rafael J. Wysocki 
Reviewed-by: Ben Skeggs 
---

v1 -> v2:
   * Resend with a different From and S-o-b address and with R-by from Ben.
 No other changes.

---
 drivers/gpu/drm/nouveau/nouveau_acpi.c |9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

Index: linux-pm/drivers/gpu/drm/nouveau/nouveau_acpi.c
===
--- linux-pm.orig/drivers/gpu/drm/nouveau/nouveau_acpi.c
+++ linux-pm/drivers/gpu/drm/nouveau/nouveau_acpi.c
@@ -364,7 +364,6 @@ void *
 nouveau_acpi_edid(struct drm_device *dev, struct drm_connector *connector)
 {
struct acpi_device *acpidev;
-   acpi_handle handle;
int type, ret;
void *edid;
 
@@ -377,12 +376,8 @@ nouveau_acpi_edid(struct drm_device *dev
return NULL;
}
 
-   handle = ACPI_HANDLE(dev->dev);
-   if (!handle)
-   return NULL;
-
-   ret = acpi_bus_get_device(handle, );
-   if (ret)
+   acpidev = ACPI_COMPANION(dev->dev);
+   if (!acpidev)
return NULL;
 
ret = acpi_video_get_edid(acpidev, type, -1, );

[Nouveau] [PATCH v1 2/7] nouveau: ACPI: Use the ACPI_COMPANION() macro directly

2021-10-12 Thread Rafael J. Wysocki

From: Rafael J. Wysocki 

The ACPI_HANDLE() macro is a wrapper arond the ACPI_COMPANION()
macro and the ACPI handle produced by the former comes from the
ACPI device object produced by the latter, so it is way more
straightforward to evaluate the latter directly instead of passing
the handle produced by the former to acpi_bus_get_device().

Modify nouveau_acpi_edid() accordingly (no intentional functional
impact).

Signed-off-by: Rafael J. Wysocki 
---
 drivers/gpu/drm/nouveau/nouveau_acpi.c |9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

Index: linux-pm/drivers/gpu/drm/nouveau/nouveau_acpi.c
===
--- linux-pm.orig/drivers/gpu/drm/nouveau/nouveau_acpi.c
+++ linux-pm/drivers/gpu/drm/nouveau/nouveau_acpi.c
@@ -364,7 +364,6 @@ void *
 nouveau_acpi_edid(struct drm_device *dev, struct drm_connector *connector)
 {
struct acpi_device *acpidev;
-   acpi_handle handle;
int type, ret;
void *edid;
 
@@ -377,12 +376,8 @@ nouveau_acpi_edid(struct drm_device *dev
return NULL;
}
 
-   handle = ACPI_HANDLE(dev->dev);
-   if (!handle)
-   return NULL;
-
-   ret = acpi_bus_get_device(handle, );
-   if (ret)
+   acpidev = ACPI_COMPANION(dev->dev);
+   if (!acpidev)
return NULL;
 
ret = acpi_video_get_edid(acpidev, type, -1, );

Re: [Nouveau] [PATCH 000/141] Fix fall-through warnings for Clang

2020-11-23 Thread Rafael J. Wysocki

On Mon, Nov 23, 2020 at 4:58 PM James Bottomley
 wrote:
>
> On Mon, 2020-11-23 at 15:19 +0100, Miguel Ojeda wrote:
> > On Sun, Nov 22, 2020 at 11:36 PM James Bottomley
> >  wrote:

[cut]

> >
> > Maintainers routinely review 1-line trivial patches, not to mention
> > internal API changes, etc.
>
> We're also complaining about the inability to recruit maintainers:
>
> https://www.theregister.com/2020/06/30/hard_to_find_linux_maintainers_says_torvalds/
>
> And burn out:
>
> http://antirez.com/news/129

Right.

> The whole crux of your argument seems to be maintainers' time isn't
> important so we should accept all trivial patches ... I'm pushing back
> on that assumption in two places, firstly the valulessness of the time
> and secondly that all trivial patches are valuable.
>
> > If some company does not want to pay for that, that's fine, but they
> > don't get to be maintainers and claim `Supported`.
>
> What I'm actually trying to articulate is a way of measuring value of
> the patch vs cost ... it has nothing really to do with who foots the
> actual bill.
>
> One thesis I'm actually starting to formulate is that this continual
> devaluing of maintainers is why we have so much difficulty keeping and
> recruiting them.

Absolutely.

This is just one of the factors involved, but a significant one IMV.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

2019-12-09 Thread Rafael J. Wysocki

On Mon, Dec 9, 2019 at 12:17 PM Karol Herbst  wrote:
>
> anybody any other ideas?

Not yet, but I'm trying to collect some more information.

> It seems that both patches don't really fix
> the issue and I have no idea left on my side to try out. The only
> thing left I could do to further investigate would be to reverse
> engineer the Nvidia driver as they support runpm on Turing+ GPUs now,
> but I've heard users having similar issues to the one Lyude told us
> about... and I couldn't verify that the patches help there either in a
> reliable way.

It looks like the newer (8+) versions of Windows expect the GPU driver
to prepare the GPU for power removal in some specific way and the
latter fails if the GPU has not been prepared as expected.

Because testing indicates that the Windows 7 path in the platform
firmware works, it may be worth trying to do what it does to the PCIe
link before invoking the _OFF method for the power resource
controlling the GPU power.

If the Mika's theory that the Win7 path simply turns the PCIe link off
is correct, then whatever the _OFF method tries to do to the link
after that should not matter.

> On Wed, Nov 27, 2019 at 8:55 PM Lyude Paul  wrote:
> >
> > On Wed, 2019-11-27 at 12:51 +0100, Karol Herbst wrote:
> > > On Wed, Nov 27, 2019 at 12:49 PM Mika Westerberg
> > >  wrote:
> > > > On Tue, Nov 26, 2019 at 06:10:36PM -0500, Lyude Paul wrote:
> > > > > Hey-this is almost certainly not the right place in this thread to
> > > > > respond,
> > > > > but this thread has gotten so deep evolution can't push the subject
> > > > > further to
> > > > > the right, heh. So I'll just respond here.
> > > >
> > > > :)
> > > >
> > > > > I've been following this and helping out Karol with testing here and
> > > > > there.
> > > > > They had me test Bjorn's PCI branch on the X1 Extreme 2nd generation,
> > > > > which
> > > > > has a turing GPU and 8086:1901 PCI bridge.
> > > > >
> > > > > I was about to say "the patch fixed things, hooray!" but it seems that
> > > > > after
> > > > > trying runtime suspend/resume a couple times things fall apart again:
> > > >
> > > > You mean $subject patch, no?
> > > >
> > >
> > > no, I told Lyude to test the pci/pm branch as the runpm errors we saw
> > > on that machine looked different. Some BAR error the GPU reported
> > > after it got resumed, so I was wondering if the delays were helping
> > > with that. But after some cycles it still caused the same issue, that
> > > the GPU disappeared. Later testing also showed that my patch also
> > > didn't seem to help with this error sadly :/
> > >
> > > > > [  686.883247] nouveau :01:00.0: DRM: suspending object tree...
> > > > > [  752.866484] ACPI Error: Aborting method \_SB.PCI0.PEG0.PEGP.NVPO 
> > > > > due
> > > > > to previous error (AE_AML_LOOP_TIMEOUT) (20190816/psparse-529)
> > > > > [  752.866508] ACPI Error: Aborting method \_SB.PCI0.PGON due to
> > > > > previous error (AE_AML_LOOP_TIMEOUT) (20190816/psparse-529)
> > > > > [  752.866521] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due
> > > > > to previous error (AE_AML_LOOP_TIMEOUT) (20190816/psparse-529)
> > > >
> > > > This is probably the culprit. The same AML code fails to properly turn
> > > > on the device.
> > > >
> > > > Is acpidump from this system available somewhere?
> >
> > Attached it to this email
> >
> > > >
> > --
> > Cheers,
> > Lyude Paul
>
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

2019-11-22 Thread Rafael J. Wysocki

On Fri, Nov 22, 2019 at 12:52 PM Mika Westerberg
 wrote:
>

[cut]

[I'm really running out of time for today, unfortunately.]

> > > > The current design is mostly based on the PCI PM Spec 1.2, so it would
> > > > be consequent to follow system-wide suspend in PM-runtime and avoid
> > > > putting PCIe ports holding devices in D0 into any low-power states.
> > > > but that would make the approach in the $subject patch ineffective.
> > > >
> > > > Moreover, the fact that there are separate branches for "Windows 7"
> > > > and "Windows 8+" kind of suggest a change in the expected behavior
> > > > between Windows 7 and Windows 8, from the AML perspective.  I would
> > > > guess that Windows 7 followed PCI PM 1.2 and Windows 8 (and later)
> > > > does something else.
> > >
> > > My understanding (which may not be correct) is that up to Windows 7 it
> > > never put the devices into D3cold runtime. Only when the system entered
> > > Sx states it evaluated the _OFF methods.
> >
> > I see.

I think I have misunderstood what you said.

I also think that Windows 7 and before didin't do RTD3, but it did PCI
PM nevertheless and platform firmware could expect it to behave in a
specific way in that respect.  That expected behavior seems to have
changed in the next generations of Windows, as reflected by the OS
version and _REV checks in ASL.

> > > Starting from Windows 8 it started doing this runtime so devices can
> > > enter D3cold even when system is in S0.
> >
> > Hmm.  So by setting _REV to 5 we effectively change the _OFF into a NOP?
>
> No, there are two paths in the _OFF() and them some common code such as
> removing power etc.
>
> What the _REV 5 did is that it went into path where the AML seemed to
> directly disable the link.
>
> The other path that is taken with Windows 8+ does not disable the link
> but instead it puts it to low power L2 or L3 state (I suppose L3 since
> it removes the power and the GPU probably does not support wake).

OK, so the very existence of the two paths means that the OS behavior
expected by the firmware in the two cases represented by them is
different.  Presumably, the expected hardware configuration in which
the AML runs also is different in these two cases.

> The ASL code is below. PGOF() gets called from the power resource
> _OFF():

I'll look at it in detail when I have some more time later.

> Method (PGOF, 1, Serialized)
> {
> PIOF = Arg0
> If ((PIOF == Zero))
> {
> If ((SGGP == Zero))
> {
> Return (Zero)
> }
> }
> ElseIf ((PIOF == One))
> {
> If ((P1GP == Zero))
> {
> Return (Zero)
> }
> }
> ElseIf ((PIOF == 0x02))
> {
> If ((P2GP == Zero))
> {
> Return (Zero)
> }
> }
>
> PEBA = \XBAS /* External reference */
> PDEV = GDEV (PIOF)
> PFUN = GFUN (PIOF)
> Name (SCLK, Package (0x03)
> {
> One,
> 0x80,
> Zero
> })
> If ((CCHK (PIOF, Zero) == Zero))
> {
> Return (Zero)
> }
>
> \_SB.PCI0.PEG0.PEGP.LTRE = \_SB.PCI0.PEG0.LREN
> If ((Arg0 == Zero))
> {
> ELC0 = LCT0 /* \_SB_.PCI0.LCT0 */
> H0VI = S0VI /* \_SB_.PCI0.S0VI */
> H0DI = S0DI /* \_SB_.PCI0.S0DI */
> ECP0 = LCP0 /* \_SB_.PCI0.LCP0 */
> }
> ElseIf ((Arg0 == One))
> {
> ELC1 = LCT1 /* \_SB_.PCI0.LCT1 */
> H1VI = S1VI /* \_SB_.PCI0.S1VI */
> H1DI = S1DI /* \_SB_.PCI0.S1DI */
> ECP1 = LCP1 /* \_SB_.PCI0.LCP1 */
> }
> ElseIf ((Arg0 == 0x02))
> {
> ELC2 = LCT2 /* \_SB_.PCI0.LCT2 */
> H2VI = S2VI /* \_SB_.PCI0.S2VI */
> H2DI = S2DI /* \_SB_.PCI0.S2DI */
> ECP2 = LCP2 /* \_SB_.PCI0.LCP2 */
> }
>
> If (((OSYS <= 0x07D9) || ((OSYS == 0x07DF) && (_REV ==
> 0x05
> {
> If ((PIOF == Zero))
> {
> P0LD = One
> TCNT = Zero
> While ((TCNT < LDLY))
> {
> If ((P0LT == 0x08))
> {
> Break
> }
>
> Sleep (0x10)
> TCNT += 0x10
> }
>
> P0RM = One
> P0AP = 0x03
> }
> ElseIf ((PIOF == One))
> {
> P1LD = One
>

Re: [Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

2019-11-22 Thread Rafael J. Wysocki

On Fri, Nov 22, 2019 at 12:34 PM Karol Herbst  wrote:
>
> On Fri, Nov 22, 2019 at 12:30 PM Rafael J. Wysocki  wrote:
> >

[cut]

> >
>
> the issue is not AML related at all as I am able to reproduce this
> issue without having to invoke any of that at all, I just need to poke
> into the PCI register directly to cut the power.

Since the register is not documented, you don't actually know what
exactly happens when it is written to.

You basically are saying something like "if I write a specific value
to an undocumented register, that makes things fail".  And yes,
writing things to undocumented registers is likely to cause failure to
happen, in general.

The point is that the kernel will never write into this register by itself.

> The register is not documented, but effectively what the AML code is writing 
> to as well.

So that AML code is problematic.  It expects the write to do something
useful, but that's not the case.  Without the AML, the register would
not have been written to at all.

> Of course it might also be that the code I was testing it was doing
> things in a non conformant way and I just hit a different issue as
> well, but in the end I don't think that the AML code is the root cause
> of all of that.

If AML is not involved at all, things work.  You've just said so in
another message in this thread, quoting verbatim:

"yes. In my previous testing I was poking into the PCI registers of the
bridge controller and the GPU directly and that never caused any
issues as long as I limited it to putting the devices into D3hot."

You cannot claim a hardware bug just because a write to an
undocumented register from AML causes things to break.

First, that may be a bug in the AML (which is not unheard of).
Second, and that is more likely, the expectations of the AML code may
not be met at the time it is run.

Assuming the latter, the root cause is really that the kernel executes
the AML in a hardware configuration in which the expectations of that
AML are not met.

We are now trying to understand what those expectations may be and so
how to cause them to be met.

Your observation that the issue can be avoided if the GPU is not put
into D3hot by a PMCSR write is a step in that direction and it is a
good finding.  The information from Mika based on the ASL analysis is
helpful too.  Let's not jump to premature conclusions too quickly,
though.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

2019-11-22 Thread Rafael J. Wysocki

On Fri, Nov 22, 2019 at 11:36 AM Mika Westerberg
 wrote:
>
> On Thu, Nov 21, 2019 at 11:39:23PM +0100, Rafael J. Wysocki wrote:
> > On Thu, Nov 21, 2019 at 8:49 PM Mika Westerberg
> >  wrote:
> > >
> > > On Thu, Nov 21, 2019 at 04:43:24PM +0100, Rafael J. Wysocki wrote:
> > > > On Thu, Nov 21, 2019 at 1:52 PM Mika Westerberg
> > > >  wrote:
> > > > >
> > > > > On Thu, Nov 21, 2019 at 01:46:14PM +0200, Mika Westerberg wrote:
> > > > > > On Thu, Nov 21, 2019 at 12:34:22PM +0100, Rafael J. Wysocki wrote:
> > > > > > > On Thu, Nov 21, 2019 at 12:28 PM Mika Westerberg
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Wed, Nov 20, 2019 at 11:29:33PM +0100, Rafael J. Wysocki 
> > > > > > > > wrote:
> > > > > > > > > > last week or so I found systems where the GPU was under the 
> > > > > > > > > > "PCI
> > > > > > > > > > Express Root Port" (name from lspci) and on those systems 
> > > > > > > > > > all of that
> > > > > > > > > > seems to work. So I am wondering if it's indeed just the 
> > > > > > > > > > 0x1901 one,
> > > > > > > > > > which also explains Mikas case that Thunderbolt stuff works 
> > > > > > > > > > as devices
> > > > > > > > > > never get populated under this particular bridge 
> > > > > > > > > > controller, but under
> > > > > > > > > > those "Root Port"s
> > > > > > > > >
> > > > > > > > > It always is a PCIe port, but its location within the SoC may 
> > > > > > > > > matter.
> > > > > > > >
> > > > > > > > Exactly. Intel hardware has PCIe ports on CPU side (these are 
> > > > > > > > called
> > > > > > > > PEG, PCI Express Graphics, ports), and the PCH side. I think 
> > > > > > > > the IP is
> > > > > > > > still the same.
> > > > > > > >
> > > > > > > > > Also some custom AML-based power management is involved and 
> > > > > > > > > that may
> > > > > > > > > be making specific assumptions on the configuration of the 
> > > > > > > > > SoC and the
> > > > > > > > > GPU at the time of its invocation which unfortunately are not 
> > > > > > > > > known to
> > > > > > > > > us.
> > > > > > > > >
> > > > > > > > > However, it looks like the AML invoked to power down the GPU 
> > > > > > > > > from
> > > > > > > > > acpi_pci_set_power_state() gets confused if it is not in PCI 
> > > > > > > > > D0 at
> > > > > > > > > that point, so it looks like that AML tries to access device 
> > > > > > > > > memory on
> > > > > > > > > the GPU (beyond the PCI config space) or similar which is not
> > > > > > > > > accessible in PCI power states below D0.
> > > > > > > >
> > > > > > > > Or the PCI config space of the GPU when the parent root port is 
> > > > > > > > in D3hot
> > > > > > > > (as it is the case here). Also then the GPU config space is not
> > > > > > > > accessible.
> > > > > > >
> > > > > > > Why would the parent port be in D3hot at that point?  Wouldn't 
> > > > > > > that be
> > > > > > > a suspend ordering violation?
> > > > > >
> > > > > > No. We put the GPU into D3hot first,
> > > >
> > > > OK
> > > >
> > > > Does this involve any AML, like a _PS3 under the GPU object?
> > >
> > > I don't see _PS3 (nor _PS0) for that object. If I read it right the GPU
> > > itself is not described in ACPI tables at all.
> >
> > OK
> >
> > > > > > then the root port and then turn
> > > > > > off the power resource (which is attached to the root port) 
> > > > > > resulting
> > &g

Re: [Nouveau] [PATCH v5] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

2019-11-22 Thread Rafael J. Wysocki

On Fri, Nov 22, 2019 at 1:22 AM Karol Herbst  wrote:
>
> Fixes state transitions of Nvidia Pascal GPUs from D3cold into higher device
> states.
>
> v2: convert to pci_dev quirk
> put a proper technical explanation of the issue as a in-code comment
> v3: disable it only for certain combinations of intel and nvidia hardware
> v4: simplify quirk by setting flag on the GPU itself
> v5: restructure quirk to make it easier to add new IDs
> fix whitespace issues
> fix potential NULL pointer access
> update the quirk documentation
>
> Signed-off-by: Karol Herbst 
> Cc: Bjorn Helgaas 
> Cc: Lyude Paul 
> Cc: Rafael J. Wysocki 
> Cc: Mika Westerberg 
> Cc: linux-...@vger.kernel.org
> Cc: linux...@vger.kernel.org
> Cc: dri-de...@lists.freedesktop.org
> Cc: nouveau@lists.freedesktop.org
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205623
> ---
>  drivers/pci/pci.c|  7 ++
>  drivers/pci/quirks.c | 51 
>  include/linux/pci.h  |  1 +
>  3 files changed, 59 insertions(+)
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 57f15a7e6f0b..e08db2daa924 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -850,6 +850,13 @@ static int pci_raw_set_power_state(struct pci_dev *dev, 
> pci_power_t state)
>|| (state == PCI_D2 && !dev->d2_support))
> return -EIO;
>
> +   /*
> +* Check if we have a bad combination of bridge controller and nvidia
> +* GPU, see quirk_broken_nv_runpm for more info
> +*/
> +   if (state != PCI_D0 && dev->broken_nv_runpm)
> +   return 0;

The result of this change in the suspend-to-idle path will be leaving
the device and its PCIe port in D0 while suspended, unless the device
itself has power management methods in the ACPI tables (according to
Mika that is not the case).

I don't think that this is desirable.

> +
> pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, );
>
> /*
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 44c4ae1abd00..24e3f247d291 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -5268,3 +5268,54 @@ static void 
> quirk_reset_lenovo_thinkpad_p50_nvgpu(struct pci_dev *pdev)
>  DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, 0x13b1,
>   PCI_CLASS_DISPLAY_VGA, 8,
>   quirk_reset_lenovo_thinkpad_p50_nvgpu);
> +
> +/*
> + * Some Intel PCIe bridge controllers cause devices to not reappear doing a
> + * D0 -> D3hot -> D3cold -> D0 sequence.

This is inaccurate and not entirely fair AFAICS.

First off, what is a "PCIe bridge controller"?  A PCIe root complex?

Second, I don't think that you really can blame hardware here, because
the problem is related to AML (forcing a different code path in AML
makes it go away, so the same hardware with different AML would work).

More precisely, the behavior of the kernel is not what is expected by
AML associated with the PCIe port holding the device.

> Skipping the intermediate D3hot step
> + * seems to make it work again.

Yes, but the change would need to cover both the PM-runtime and
suspend-to-idle code paths.

Also it may be driver-induced rather than quirk-based.

> + *
> + * This leads to various manifestations of this issue:
> + *  - AIML code execution hits an infinite loop (as the coe waits on device

Typo: coe -> code

> + *memory to change).

Which AML code is this, the power-off part or power-on part?  Is this
AML code associated with the GPU or with the PCIe port holding it (I
guess the latter from what Mika said)?

Also IIRC ACPICA has a mechanism to break infinite loops in AML by
aborting the looping method after a timeout.

> + *  - kernel crashes, as all PCI reads return -1, which most code isn't able
> + *to handle well enough.
> + *  - sudden shutdowns, as the kernel identified an unrecoverable error after
> + *userspace tries to access the GPU.

IMO it would be enough to say that the GPU is not accessible after an
attempt to remove power from it.

> + *
> + * In all cases dmesg will contain at least one line like this:
> + * 'nouveau :01:00.0: Refused to change power state, currently in D3'
> + * followed by a lot of nouveau timeouts.
> + *
> + * ACPI code

Which ACPI code?

> writes bit 0x80 to the not documented PCI register 0x248 of the

0x248 relative to what?  A PCI bar (if so then which one) or the PCI
config space (and which part of it if so)?

> + * Intel PCIe bridge controller (0x1901) in order to power down the GPU.

This doesn't seem accurate.  It rather writes to this register to
change the state of the PCIe link between the GPU and the PC

Re: [Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

2019-11-22 Thread Rafael J. Wysocki

On Fri, Nov 22, 2019 at 1:13 AM Karol Herbst  wrote:
>
> so while trying to test with d3cold disabled, I noticed that I run
> into the exact same error.

Does this mean that you disabled d3cold on the GPU via sysfs (the
"d3cold_allowed" attribute was 0) and the original problem still
occurred in that configuration?

> And I verified that the
> \_SB.PCI0.PEG0.PG00._STA returns 1, which indicates it should still be
> turned on.

I don't really understand this comment, so can you explain it a bit to
me, please?
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

2019-11-21 Thread Rafael J. Wysocki

On Thu, Nov 21, 2019 at 8:49 PM Mika Westerberg
 wrote:
>
> On Thu, Nov 21, 2019 at 04:43:24PM +0100, Rafael J. Wysocki wrote:
> > On Thu, Nov 21, 2019 at 1:52 PM Mika Westerberg
> >  wrote:
> > >
> > > On Thu, Nov 21, 2019 at 01:46:14PM +0200, Mika Westerberg wrote:
> > > > On Thu, Nov 21, 2019 at 12:34:22PM +0100, Rafael J. Wysocki wrote:
> > > > > On Thu, Nov 21, 2019 at 12:28 PM Mika Westerberg
> > > > >  wrote:
> > > > > >
> > > > > > On Wed, Nov 20, 2019 at 11:29:33PM +0100, Rafael J. Wysocki wrote:
> > > > > > > > last week or so I found systems where the GPU was under the "PCI
> > > > > > > > Express Root Port" (name from lspci) and on those systems all 
> > > > > > > > of that
> > > > > > > > seems to work. So I am wondering if it's indeed just the 0x1901 
> > > > > > > > one,
> > > > > > > > which also explains Mikas case that Thunderbolt stuff works as 
> > > > > > > > devices
> > > > > > > > never get populated under this particular bridge controller, 
> > > > > > > > but under
> > > > > > > > those "Root Port"s
> > > > > > >
> > > > > > > It always is a PCIe port, but its location within the SoC may 
> > > > > > > matter.
> > > > > >
> > > > > > Exactly. Intel hardware has PCIe ports on CPU side (these are called
> > > > > > PEG, PCI Express Graphics, ports), and the PCH side. I think the IP 
> > > > > > is
> > > > > > still the same.
> > > > > >
> > > > > > > Also some custom AML-based power management is involved and that 
> > > > > > > may
> > > > > > > be making specific assumptions on the configuration of the SoC 
> > > > > > > and the
> > > > > > > GPU at the time of its invocation which unfortunately are not 
> > > > > > > known to
> > > > > > > us.
> > > > > > >
> > > > > > > However, it looks like the AML invoked to power down the GPU from
> > > > > > > acpi_pci_set_power_state() gets confused if it is not in PCI D0 at
> > > > > > > that point, so it looks like that AML tries to access device 
> > > > > > > memory on
> > > > > > > the GPU (beyond the PCI config space) or similar which is not
> > > > > > > accessible in PCI power states below D0.
> > > > > >
> > > > > > Or the PCI config space of the GPU when the parent root port is in 
> > > > > > D3hot
> > > > > > (as it is the case here). Also then the GPU config space is not
> > > > > > accessible.
> > > > >
> > > > > Why would the parent port be in D3hot at that point?  Wouldn't that be
> > > > > a suspend ordering violation?
> > > >
> > > > No. We put the GPU into D3hot first,
> >
> > OK
> >
> > Does this involve any AML, like a _PS3 under the GPU object?
>
> I don't see _PS3 (nor _PS0) for that object. If I read it right the GPU
> itself is not described in ACPI tables at all.

OK

> > > > then the root port and then turn
> > > > off the power resource (which is attached to the root port) resulting
> > > > the topology entering D3cold.
> > >
> > > I don't see that happening in the AML though.
> >
> > Which AML do you mean, specifically?  The _OFF method for the root
> > port's _PR3 power resource or something else?
>
> The root port's _OFF method for the power resource returned by its _PR3.

OK, so without the $subject patch we (1) program the downstream
component (GPU) into D3hot, then we (2) program the port holding it
into D3hot and then we (3) let the AML (_OFF for the power resource
listed by _PR3 under the port object) run.

Something strange happens at this point (and I guess that _OFF doesn't
even reach the point where it removes power from the port which is why
we see a lock-up).

We know that skipping (1) makes things work and we kind of suspect
that skipping (3) would make things work either, but what about doing
(1) and (3) without (2)?

> > > Basically the difference is that when Windows 7 or Linux (the _REV==5
> > > check) then we directly do link disable whereas in Windows 8+ we

Re: [Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

2019-11-21 Thread Rafael J. Wysocki

On Thu, Nov 21, 2019 at 5:06 PM Karol Herbst  wrote:
>
> On Thu, Nov 21, 2019 at 4:47 PM Rafael J. Wysocki  wrote:
> >
> > On Thu, Nov 21, 2019 at 1:53 PM Karol Herbst  wrote:
> > >
> > > On Thu, Nov 21, 2019 at 12:46 PM Mika Westerberg
> > >  wrote:
> > > >
> > > > On Thu, Nov 21, 2019 at 12:34:22PM +0100, Rafael J. Wysocki wrote:
> > > > > On Thu, Nov 21, 2019 at 12:28 PM Mika Westerberg
> > > > >  wrote:
> > > > > >
> > > > > > On Wed, Nov 20, 2019 at 11:29:33PM +0100, Rafael J. Wysocki wrote:
> > > > > > > > last week or so I found systems where the GPU was under the "PCI
> > > > > > > > Express Root Port" (name from lspci) and on those systems all 
> > > > > > > > of that
> > > > > > > > seems to work. So I am wondering if it's indeed just the 0x1901 
> > > > > > > > one,
> > > > > > > > which also explains Mikas case that Thunderbolt stuff works as 
> > > > > > > > devices
> > > > > > > > never get populated under this particular bridge controller, 
> > > > > > > > but under
> > > > > > > > those "Root Port"s
> > > > > > >
> > > > > > > It always is a PCIe port, but its location within the SoC may 
> > > > > > > matter.
> > > > > >
> > > > > > Exactly. Intel hardware has PCIe ports on CPU side (these are called
> > > > > > PEG, PCI Express Graphics, ports), and the PCH side. I think the IP 
> > > > > > is
> > > > > > still the same.
> > > > > >
> > >
> > > yeah, I meant the bridge controller with the ID 0x1901 is on the CPU
> > > side. And if the Nvidia GPU is on a port on the PCH side it all seems
> > > to work just fine.
> >
> > But that may involve different AML too, may it not?
> >
> > > > > > > Also some custom AML-based power management is involved and that 
> > > > > > > may
> > > > > > > be making specific assumptions on the configuration of the SoC 
> > > > > > > and the
> > > > > > > GPU at the time of its invocation which unfortunately are not 
> > > > > > > known to
> > > > > > > us.
> > > > > > >
> > > > > > > However, it looks like the AML invoked to power down the GPU from
> > > > > > > acpi_pci_set_power_state() gets confused if it is not in PCI D0 at
> > > > > > > that point, so it looks like that AML tries to access device 
> > > > > > > memory on
> > > > > > > the GPU (beyond the PCI config space) or similar which is not
> > > > > > > accessible in PCI power states below D0.
> > > > > >
> > > > > > Or the PCI config space of the GPU when the parent root port is in 
> > > > > > D3hot
> > > > > > (as it is the case here). Also then the GPU config space is not
> > > > > > accessible.
> > > > >
> > > > > Why would the parent port be in D3hot at that point?  Wouldn't that be
> > > > > a suspend ordering violation?
> > > >
> > > > No. We put the GPU into D3hot first, then the root port and then turn
> > > > off the power resource (which is attached to the root port) resulting
> > > > the topology entering D3cold.
> > > >
> > >
> > > If the kernel does a D0 -> D3hot -> D0 cycle this works as well, but
> > > the power savings are way lower, so I kind of prefer skipping D3hot
> > > instead of D3cold. Skipping D3hot doesn't seem to make any difference
> > > in power savings in my testing.
> >
> > OK
> >
> > What exactly did you do to skip D3cold in your testing?
> >
>
> For that I poked into the PCI registers directly and skipped doing the
> ACPI calls and simply checked for the idle power consumption on my
> laptop.

That doesn't involve the PCIe port PM, however.

> But I guess I should retest with calling pci_d3cold_disable
> from nouveau instead? Or is there a different preferable way of
> testing this?

There is a sysfs attribute called "d3cold_allowed" which can be used
for "blocking" D3cold, so can you please retest using that?
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

2019-11-21 Thread Rafael J. Wysocki

On Thu, Nov 21, 2019 at 1:53 PM Karol Herbst  wrote:
>
> On Thu, Nov 21, 2019 at 12:46 PM Mika Westerberg
>  wrote:
> >
> > On Thu, Nov 21, 2019 at 12:34:22PM +0100, Rafael J. Wysocki wrote:
> > > On Thu, Nov 21, 2019 at 12:28 PM Mika Westerberg
> > >  wrote:
> > > >
> > > > On Wed, Nov 20, 2019 at 11:29:33PM +0100, Rafael J. Wysocki wrote:
> > > > > > last week or so I found systems where the GPU was under the "PCI
> > > > > > Express Root Port" (name from lspci) and on those systems all of 
> > > > > > that
> > > > > > seems to work. So I am wondering if it's indeed just the 0x1901 one,
> > > > > > which also explains Mikas case that Thunderbolt stuff works as 
> > > > > > devices
> > > > > > never get populated under this particular bridge controller, but 
> > > > > > under
> > > > > > those "Root Port"s
> > > > >
> > > > > It always is a PCIe port, but its location within the SoC may matter.
> > > >
> > > > Exactly. Intel hardware has PCIe ports on CPU side (these are called
> > > > PEG, PCI Express Graphics, ports), and the PCH side. I think the IP is
> > > > still the same.
> > > >
>
> yeah, I meant the bridge controller with the ID 0x1901 is on the CPU
> side. And if the Nvidia GPU is on a port on the PCH side it all seems
> to work just fine.

But that may involve different AML too, may it not?

> > > > > Also some custom AML-based power management is involved and that may
> > > > > be making specific assumptions on the configuration of the SoC and the
> > > > > GPU at the time of its invocation which unfortunately are not known to
> > > > > us.
> > > > >
> > > > > However, it looks like the AML invoked to power down the GPU from
> > > > > acpi_pci_set_power_state() gets confused if it is not in PCI D0 at
> > > > > that point, so it looks like that AML tries to access device memory on
> > > > > the GPU (beyond the PCI config space) or similar which is not
> > > > > accessible in PCI power states below D0.
> > > >
> > > > Or the PCI config space of the GPU when the parent root port is in D3hot
> > > > (as it is the case here). Also then the GPU config space is not
> > > > accessible.
> > >
> > > Why would the parent port be in D3hot at that point?  Wouldn't that be
> > > a suspend ordering violation?
> >
> > No. We put the GPU into D3hot first, then the root port and then turn
> > off the power resource (which is attached to the root port) resulting
> > the topology entering D3cold.
> >
>
> If the kernel does a D0 -> D3hot -> D0 cycle this works as well, but
> the power savings are way lower, so I kind of prefer skipping D3hot
> instead of D3cold. Skipping D3hot doesn't seem to make any difference
> in power savings in my testing.

OK

What exactly did you do to skip D3cold in your testing?
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

2019-11-21 Thread Rafael J. Wysocki

On Thu, Nov 21, 2019 at 1:52 PM Mika Westerberg
 wrote:
>
> On Thu, Nov 21, 2019 at 01:46:14PM +0200, Mika Westerberg wrote:
> > On Thu, Nov 21, 2019 at 12:34:22PM +0100, Rafael J. Wysocki wrote:
> > > On Thu, Nov 21, 2019 at 12:28 PM Mika Westerberg
> > >  wrote:
> > > >
> > > > On Wed, Nov 20, 2019 at 11:29:33PM +0100, Rafael J. Wysocki wrote:
> > > > > > last week or so I found systems where the GPU was under the "PCI
> > > > > > Express Root Port" (name from lspci) and on those systems all of 
> > > > > > that
> > > > > > seems to work. So I am wondering if it's indeed just the 0x1901 one,
> > > > > > which also explains Mikas case that Thunderbolt stuff works as 
> > > > > > devices
> > > > > > never get populated under this particular bridge controller, but 
> > > > > > under
> > > > > > those "Root Port"s
> > > > >
> > > > > It always is a PCIe port, but its location within the SoC may matter.
> > > >
> > > > Exactly. Intel hardware has PCIe ports on CPU side (these are called
> > > > PEG, PCI Express Graphics, ports), and the PCH side. I think the IP is
> > > > still the same.
> > > >
> > > > > Also some custom AML-based power management is involved and that may
> > > > > be making specific assumptions on the configuration of the SoC and the
> > > > > GPU at the time of its invocation which unfortunately are not known to
> > > > > us.
> > > > >
> > > > > However, it looks like the AML invoked to power down the GPU from
> > > > > acpi_pci_set_power_state() gets confused if it is not in PCI D0 at
> > > > > that point, so it looks like that AML tries to access device memory on
> > > > > the GPU (beyond the PCI config space) or similar which is not
> > > > > accessible in PCI power states below D0.
> > > >
> > > > Or the PCI config space of the GPU when the parent root port is in D3hot
> > > > (as it is the case here). Also then the GPU config space is not
> > > > accessible.
> > >
> > > Why would the parent port be in D3hot at that point?  Wouldn't that be
> > > a suspend ordering violation?
> >
> > No. We put the GPU into D3hot first,

OK

Does this involve any AML, like a _PS3 under the GPU object?

> > then the root port and then turn
> > off the power resource (which is attached to the root port) resulting
> > the topology entering D3cold.
>
> I don't see that happening in the AML though.

Which AML do you mean, specifically?  The _OFF method for the root
port's _PR3 power resource or something else?

> Basically the difference is that when Windows 7 or Linux (the _REV==5
> check) then we directly do link disable whereas in Windows 8+ we invoke
> LKDS() method that puts the link into L2/L3. None of the fields they
> access seem to touch the GPU itself.

So that may be where the problem is.

Putting the downstream component into PCI D[1-3] is expected to put
the link into L1, so I'm not sure how that plays with the later
attempt to put it into L2/L3 Ready.

Also, L2/L3 Ready is expected to be transient, so finally power should
be removed somehow.

> LKDS() for the first PEG port looks like this:
>
>P0L2 = One
>Sleep (0x10)
>Local0 = Zero
>While (P0L2)
>{
> If ((Local0 > 0x04))
> {
> Break
> }
>
> Sleep (0x10)
> Local0++
>}
>
> One thing that comes to mind is that the loop can end even if P0L2 is
> not cleared as it does only 5 iterations with 16 ms sleep between. Maybe
> Sleep() is implemented differently in Windows? I mean Linux may be
> "faster" here and return prematurely and if we leave the port into D0
> this does not happen, or something. I'm just throwing out ideas :)

But this actually works for the downstream component in D0, doesn't it?

Also, if the downstream component is in D0, the port actually should
stay in D0 too, so what would happen with the $subject patch applied?
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

2019-11-21 Thread Rafael J. Wysocki

On Thu, Nov 21, 2019 at 12:28 PM Mika Westerberg
 wrote:
>
> On Wed, Nov 20, 2019 at 11:29:33PM +0100, Rafael J. Wysocki wrote:
> > > last week or so I found systems where the GPU was under the "PCI
> > > Express Root Port" (name from lspci) and on those systems all of that
> > > seems to work. So I am wondering if it's indeed just the 0x1901 one,
> > > which also explains Mikas case that Thunderbolt stuff works as devices
> > > never get populated under this particular bridge controller, but under
> > > those "Root Port"s
> >
> > It always is a PCIe port, but its location within the SoC may matter.
>
> Exactly. Intel hardware has PCIe ports on CPU side (these are called
> PEG, PCI Express Graphics, ports), and the PCH side. I think the IP is
> still the same.
>
> > Also some custom AML-based power management is involved and that may
> > be making specific assumptions on the configuration of the SoC and the
> > GPU at the time of its invocation which unfortunately are not known to
> > us.
> >
> > However, it looks like the AML invoked to power down the GPU from
> > acpi_pci_set_power_state() gets confused if it is not in PCI D0 at
> > that point, so it looks like that AML tries to access device memory on
> > the GPU (beyond the PCI config space) or similar which is not
> > accessible in PCI power states below D0.
>
> Or the PCI config space of the GPU when the parent root port is in D3hot
> (as it is the case here). Also then the GPU config space is not
> accessible.

Why would the parent port be in D3hot at that point?  Wouldn't that be
a suspend ordering violation?

> I took a look at the HP Omen ACPI tables which has similar problem and
> there is also check for Windows 7 (but not Linux) so I think one
> alternative workaround would be to add these devices into
> acpi_osi_dmi_table[] where .callback is set to dmi_disable_osi_win8 (or
> pass 'acpi_osi="!Windows 2012"' in the kernel command line).

I'd like to understand the facts that have been established so far
before deciding what to do about them. :-)
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

2019-11-21 Thread Rafael J. Wysocki

On Thu, Nov 21, 2019 at 12:17 PM Mika Westerberg
 wrote:
>
> On Thu, Nov 21, 2019 at 12:03:52PM +0100, Rafael J. Wysocki wrote:
> > On Thu, Nov 21, 2019 at 11:14 AM Mika Westerberg
> >  wrote:
> > >
> > > On Wed, Nov 20, 2019 at 10:36:31PM +0100, Karol Herbst wrote:
> > > > with the branch and patch applied:
> > > > https://gist.githubusercontent.com/karolherbst/03c4c8141b0fa292d781badfa186479e/raw/5c62640afbc57d6e69ea924c338bd2836e770d02/gistfile1.txt
> > >
> > > Thanks for testing. Too bad it did not help :( I suppose there is no
> > > change if you increase the delay to say 1s?
> >
> > Well, look at the original patch in this thread.
> >
> > What it does is to prevent the device (GPU in this particular case)
> > from going into a PCI low-power state before invoking AML to power it
> > down (the AML is still invoked after this patch AFAICS), so why would
> > that have anything to do with the delays?
>
> Yes, I know what it does :) I was just thinking that maybe it's still
> the link that does not come up when we go back to D0 I guess that's not
> the case here.

I'm not sure why that would be related to putting the device into,
say, PCI D3 before invoking AML to remove power from it.  If it is not
in PCI D3 at this point, the AML still runs and still removes power
from it IIUC, so on the way back the situation is the same regardless:
the device has no power which (again) needs to be restored by AML.
That (in principle) should not depend on what happened to the device
before it lost power.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

2019-11-21 Thread Rafael J. Wysocki

On Thu, Nov 21, 2019 at 12:08 PM Rafael J. Wysocki  wrote:
>
> On Thu, Nov 21, 2019 at 12:03 PM Rafael J. Wysocki  wrote:
> >
> > On Thu, Nov 21, 2019 at 11:14 AM Mika Westerberg
> >  wrote:
> > >
> > > On Wed, Nov 20, 2019 at 10:36:31PM +0100, Karol Herbst wrote:
> > > > with the branch and patch applied:
> > > > https://gist.githubusercontent.com/karolherbst/03c4c8141b0fa292d781badfa186479e/raw/5c62640afbc57d6e69ea924c338bd2836e770d02/gistfile1.txt
> > >
> > > Thanks for testing. Too bad it did not help :( I suppose there is no
> > > change if you increase the delay to say 1s?
> >
> > Well, look at the original patch in this thread.
> >
> > What it does is to prevent the device (GPU in this particular case)
> > from going into a PCI low-power state before invoking AML to power it
> > down (the AML is still invoked after this patch AFAICS), so why would
> > that have anything to do with the delays?
> >
> > The only reason would be the AML running too early, but that doesn't
> > seem likely.  IMO more likely is that the AML does something which
> > cannot be done to a device in a PCI low-power state.
>
> BTW, I'm wondering if anyone has tried to skip the AML instead of
> skipping the PCI PM in this case (as of 5.4-rc that would be a similar
> patch to skip the invocations of
> __pci_start/complete_power_transition() in pci_set_power_state() for
> the affected device).

Moving the dev->broken_nv_runpm test into
pci_platform_power_transition() (also for transitions into D0) would
be sufficient for that test if I'm not mistaken.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

2019-11-21 Thread Rafael J. Wysocki

On Thu, Nov 21, 2019 at 12:03 PM Rafael J. Wysocki  wrote:
>
> On Thu, Nov 21, 2019 at 11:14 AM Mika Westerberg
>  wrote:
> >
> > On Wed, Nov 20, 2019 at 10:36:31PM +0100, Karol Herbst wrote:
> > > with the branch and patch applied:
> > > https://gist.githubusercontent.com/karolherbst/03c4c8141b0fa292d781badfa186479e/raw/5c62640afbc57d6e69ea924c338bd2836e770d02/gistfile1.txt
> >
> > Thanks for testing. Too bad it did not help :( I suppose there is no
> > change if you increase the delay to say 1s?
>
> Well, look at the original patch in this thread.
>
> What it does is to prevent the device (GPU in this particular case)
> from going into a PCI low-power state before invoking AML to power it
> down (the AML is still invoked after this patch AFAICS), so why would
> that have anything to do with the delays?
>
> The only reason would be the AML running too early, but that doesn't
> seem likely.  IMO more likely is that the AML does something which
> cannot be done to a device in a PCI low-power state.

BTW, I'm wondering if anyone has tried to skip the AML instead of
skipping the PCI PM in this case (as of 5.4-rc that would be a similar
patch to skip the invocations of
__pci_start/complete_power_transition() in pci_set_power_state() for
the affected device).
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

2019-11-21 Thread Rafael J. Wysocki

On Thu, Nov 21, 2019 at 11:14 AM Mika Westerberg
 wrote:
>
> On Wed, Nov 20, 2019 at 10:36:31PM +0100, Karol Herbst wrote:
> > with the branch and patch applied:
> > https://gist.githubusercontent.com/karolherbst/03c4c8141b0fa292d781badfa186479e/raw/5c62640afbc57d6e69ea924c338bd2836e770d02/gistfile1.txt
>
> Thanks for testing. Too bad it did not help :( I suppose there is no
> change if you increase the delay to say 1s?

Well, look at the original patch in this thread.

What it does is to prevent the device (GPU in this particular case)
from going into a PCI low-power state before invoking AML to power it
down (the AML is still invoked after this patch AFAICS), so why would
that have anything to do with the delays?

The only reason would be the AML running too early, but that doesn't
seem likely.  IMO more likely is that the AML does something which
cannot be done to a device in a PCI low-power state.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

2019-11-20 Thread Rafael J. Wysocki

On Wed, Nov 20, 2019 at 10:40 PM Karol Herbst  wrote:
>
> On Wed, Nov 20, 2019 at 10:37 PM Rafael J. Wysocki  wrote:
> >
> > On Wed, Nov 20, 2019 at 4:53 PM Mika Westerberg
> >  wrote:
> > >
> > > On Wed, Nov 20, 2019 at 04:37:14PM +0100, Karol Herbst wrote:
> > > > On Wed, Nov 20, 2019 at 4:15 PM Mika Westerberg
> > > >  wrote:
> > > > >
> > > > > On Wed, Nov 20, 2019 at 01:11:52PM +0100, Karol Herbst wrote:
> > > > > > On Wed, Nov 20, 2019 at 1:09 PM Mika Westerberg
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Wed, Nov 20, 2019 at 12:58:00PM +0100, Karol Herbst wrote:
> > > > > > > > overall, what I really want to know is, _why_ does it work on 
> > > > > > > > windows?
> > > > > > >
> > > > > > > So do I ;-)
> > > > > > >
> > > > > > > > Or what are we doing differently on Linux so that it doesn't 
> > > > > > > > work? If
> > > > > > > > anybody has any idea on how we could dig into this and figure 
> > > > > > > > it out
> > > > > > > > on this level, this would probably allow us to get closer to 
> > > > > > > > the root
> > > > > > > > cause? no?
> > > > > > >
> > > > > > > Have you tried to use the acpi_rev_override parameter in your 
> > > > > > > system and
> > > > > > > does it have any effect?
> > > > > > >
> > > > > > > Also did you try to trace the ACPI _ON/_OFF() methods? I think 
> > > > > > > that
> > > > > > > should hopefully reveal something.
> > > > > > >
> > > > > >
> > > > > > I think I did in the past and it seemed to have worked, there is 
> > > > > > just
> > > > > > one big issue with this: it's a Dell specific workaround afaik, and
> > > > > > this issue plagues not just Dell, but we've seen it on HP and Lenovo
> > > > > > laptops as well, and I've heard about users having the same issues 
> > > > > > on
> > > > > > Asus and MSI laptops as well.
> > > > >
> > > > > Maybe it is not a workaround at all but instead it simply determines
> > > > > whether the system supports RTD3 or something like that (IIRC Windows 
> > > > > 8
> > > > > started supporting it). Maybe Dell added check for Linux because at 
> > > > > that
> > > > > time Linux did not support it.
> > > > >
> > > >
> > > > the point is, it's not checking it by default, so by default you still
> > > > run into the windows 8 codepath.
> > >
> > > Well you can add the quirk to acpi_rev_dmi_table[] so it goes to that
> > > path by default. There are a bunch of similar entries for Dell machines.
> >
> > OK, so the "Linux path" works and the other doesn't.
> >
> > I thought that this was the other way around, sorry for the confusion.
> >
> > > Of course this does not help the non-Dell users so we would still need
> > > to figure out the root cause.
> >
> > Right.
> >
> > Whatever it is, though, AML appears to be involved in it and AFAICS
> > there's no evidence that it affects any root ports that are not
> > populated with NVidia GPUs.
> >
>
> last week or so I found systems where the GPU was under the "PCI
> Express Root Port" (name from lspci) and on those systems all of that
> seems to work. So I am wondering if it's indeed just the 0x1901 one,
> which also explains Mikas case that Thunderbolt stuff works as devices
> never get populated under this particular bridge controller, but under
> those "Root Port"s

It always is a PCIe port, but its location within the SoC may matter.

Also some custom AML-based power management is involved and that may
be making specific assumptions on the configuration of the SoC and the
GPU at the time of its invocation which unfortunately are not known to
us.

However, it looks like the AML invoked to power down the GPU from
acpi_pci_set_power_state() gets confused if it is not in PCI D0 at
that point, so it looks like that AML tries to access device memory on
the GPU (beyond the PCI config space) or similar which is not
accessible in PCI power states below D0.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

2019-11-20 Thread Rafael J. Wysocki

On Wed, Nov 20, 2019 at 4:53 PM Mika Westerberg
 wrote:
>
> On Wed, Nov 20, 2019 at 04:37:14PM +0100, Karol Herbst wrote:
> > On Wed, Nov 20, 2019 at 4:15 PM Mika Westerberg
> >  wrote:
> > >
> > > On Wed, Nov 20, 2019 at 01:11:52PM +0100, Karol Herbst wrote:
> > > > On Wed, Nov 20, 2019 at 1:09 PM Mika Westerberg
> > > >  wrote:
> > > > >
> > > > > On Wed, Nov 20, 2019 at 12:58:00PM +0100, Karol Herbst wrote:
> > > > > > overall, what I really want to know is, _why_ does it work on 
> > > > > > windows?
> > > > >
> > > > > So do I ;-)
> > > > >
> > > > > > Or what are we doing differently on Linux so that it doesn't work? 
> > > > > > If
> > > > > > anybody has any idea on how we could dig into this and figure it out
> > > > > > on this level, this would probably allow us to get closer to the 
> > > > > > root
> > > > > > cause? no?
> > > > >
> > > > > Have you tried to use the acpi_rev_override parameter in your system 
> > > > > and
> > > > > does it have any effect?
> > > > >
> > > > > Also did you try to trace the ACPI _ON/_OFF() methods? I think that
> > > > > should hopefully reveal something.
> > > > >
> > > >
> > > > I think I did in the past and it seemed to have worked, there is just
> > > > one big issue with this: it's a Dell specific workaround afaik, and
> > > > this issue plagues not just Dell, but we've seen it on HP and Lenovo
> > > > laptops as well, and I've heard about users having the same issues on
> > > > Asus and MSI laptops as well.
> > >
> > > Maybe it is not a workaround at all but instead it simply determines
> > > whether the system supports RTD3 or something like that (IIRC Windows 8
> > > started supporting it). Maybe Dell added check for Linux because at that
> > > time Linux did not support it.
> > >
> >
> > the point is, it's not checking it by default, so by default you still
> > run into the windows 8 codepath.
>
> Well you can add the quirk to acpi_rev_dmi_table[] so it goes to that
> path by default. There are a bunch of similar entries for Dell machines.

OK, so the "Linux path" works and the other doesn't.

I thought that this was the other way around, sorry for the confusion.

> Of course this does not help the non-Dell users so we would still need
> to figure out the root cause.

Right.

Whatever it is, though, AML appears to be involved in it and AFAICS
there's no evidence that it affects any root ports that are not
populated with NVidia GPUs.

Now, one thing is still not clear to me from the discussion so far: is
the _PR3 method you mentioned defined under the GPU device object or
under the port device object?
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

2019-11-20 Thread Rafael J. Wysocki

On Wed, Nov 20, 2019 at 1:10 PM Karol Herbst  wrote:
>
> On Wed, Nov 20, 2019 at 1:06 PM Rafael J. Wysocki  wrote:
> >
> > On Wed, Nov 20, 2019 at 12:51 PM Karol Herbst  wrote:
> > >
> > > On Wed, Nov 20, 2019 at 12:48 PM Rafael J. Wysocki  
> > > wrote:
> > > >
> > > > On Wed, Nov 20, 2019 at 12:22 PM Mika Westerberg
> > > >  wrote:
> > > > >
> > > > > On Wed, Nov 20, 2019 at 11:52:22AM +0100, Rafael J. Wysocki wrote:
> > > > > > On Wed, Nov 20, 2019 at 11:18 AM Mika Westerberg
> > > > > >  wrote:
> > > > > > >
> >
> > [cut]
> >
> > > > > >
> > > > > > Oh, so does it look like we are trying to work around AML that tried
> > > > > > to work around some problematic behavior in Linux at one point?
> > > > >
> > > > > Yes, it looks like so if I read the ASL right.
> > > >
> > > > OK, so that would call for a DMI-based quirk as the real cause for the
> > > > issue seems to be the AML in question, which means a firmware problem.
> > > >
> > >
> > > And I disagree as this is a linux specific workaround and windows goes
> > > that path and succeeds. This firmware based workaround was added,
> > > because it broke on Linux.
> >
> > Apparently so at the time it was added, but would it still break after
> > the kernel changes made since then?
> >
> > Moreover, has it not become harmful now?  IOW, wouldn't it work after
> > removing the "Linux workaround" from the AML?
> >
> > The only way to verify that I can see would be to run the system with
> > custom ACPI tables without the "Linux workaround" in the AML in
> > question.
> >
>
> the workaround is not enabled by default, because it has to be
> explicitly enabled by the user.

I'm not sure what you are talking about.

I'm taking specifically about the ((OSYS == 0x07DF) && (_REV == 0x05))
check mentioned by Mika which doesn't seem to depend on user input in
any way.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

2019-11-20 Thread Rafael J. Wysocki

On Wed, Nov 20, 2019 at 1:06 PM Rafael J. Wysocki  wrote:
>
> On Wed, Nov 20, 2019 at 12:51 PM Karol Herbst  wrote:
> >
> > On Wed, Nov 20, 2019 at 12:48 PM Rafael J. Wysocki  
> > wrote:
> > >
> > > On Wed, Nov 20, 2019 at 12:22 PM Mika Westerberg
> > >  wrote:
> > > >
> > > > On Wed, Nov 20, 2019 at 11:52:22AM +0100, Rafael J. Wysocki wrote:
> > > > > On Wed, Nov 20, 2019 at 11:18 AM Mika Westerberg
> > > > >  wrote:
> > > > > >
>
> [cut]
>
> > > > >
> > > > > Oh, so does it look like we are trying to work around AML that tried
> > > > > to work around some problematic behavior in Linux at one point?
> > > >
> > > > Yes, it looks like so if I read the ASL right.
> > >
> > > OK, so that would call for a DMI-based quirk as the real cause for the
> > > issue seems to be the AML in question, which means a firmware problem.
> > >
> >
> > And I disagree as this is a linux specific workaround and windows goes
> > that path and succeeds. This firmware based workaround was added,
> > because it broke on Linux.
>
> Apparently so at the time it was added, but would it still break after
> the kernel changes made since then?
>
> Moreover, has it not become harmful now?  IOW, wouldn't it work after
> removing the "Linux workaround" from the AML?
>
> The only way to verify that I can see would be to run the system with
> custom ACPI tables without the "Linux workaround" in the AML in
> question.

Or running it with acpi_rev_override as suggested by Mika, which
effectively would be the same thing.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

2019-11-20 Thread Rafael J. Wysocki

On Wed, Nov 20, 2019 at 12:51 PM Karol Herbst  wrote:
>
> On Wed, Nov 20, 2019 at 12:48 PM Rafael J. Wysocki  wrote:
> >
> > On Wed, Nov 20, 2019 at 12:22 PM Mika Westerberg
> >  wrote:
> > >
> > > On Wed, Nov 20, 2019 at 11:52:22AM +0100, Rafael J. Wysocki wrote:
> > > > On Wed, Nov 20, 2019 at 11:18 AM Mika Westerberg
> > > >  wrote:
> > > > >

[cut]

> > > >
> > > > Oh, so does it look like we are trying to work around AML that tried
> > > > to work around some problematic behavior in Linux at one point?
> > >
> > > Yes, it looks like so if I read the ASL right.
> >
> > OK, so that would call for a DMI-based quirk as the real cause for the
> > issue seems to be the AML in question, which means a firmware problem.
> >
>
> And I disagree as this is a linux specific workaround and windows goes
> that path and succeeds. This firmware based workaround was added,
> because it broke on Linux.

Apparently so at the time it was added, but would it still break after
the kernel changes made since then?

Moreover, has it not become harmful now?  IOW, wouldn't it work after
removing the "Linux workaround" from the AML?

The only way to verify that I can see would be to run the system with
custom ACPI tables without the "Linux workaround" in the AML in
question.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

2019-11-20 Thread Rafael J. Wysocki

On Wed, Nov 20, 2019 at 12:22 PM Mika Westerberg
 wrote:
>
> On Wed, Nov 20, 2019 at 11:52:22AM +0100, Rafael J. Wysocki wrote:
> > On Wed, Nov 20, 2019 at 11:18 AM Mika Westerberg
> >  wrote:
> > >
> > > Hi Karol,
> > >
> > > On Tue, Nov 19, 2019 at 11:26:45PM +0100, Karol Herbst wrote:
> > > > On Tue, Nov 19, 2019 at 10:50 PM Bjorn Helgaas  
> > > > wrote:
> > > > >
> > > > > [+cc Dave]
> > > > >
> > > > > On Thu, Oct 17, 2019 at 02:19:01PM +0200, Karol Herbst wrote:
> > > > > > Fixes state transitions of Nvidia Pascal GPUs from D3cold into 
> > > > > > higher device
> > > > > > states.
> > > > > >
> > > > > > v2: convert to pci_dev quirk
> > > > > > put a proper technical explanation of the issue as a in-code 
> > > > > > comment
> > > > > > v3: disable it only for certain combinations of intel and nvidia 
> > > > > > hardware
> > > > > > v4: simplify quirk by setting flag on the GPU itself
> > > > >
> > > > > I have zero confidence that we understand the real problem, but we do
> > > > > need to do something with this.  I'll merge it for v5.5 if we get the
> > > > > minor procedural stuff below straightened out.
> > > > >
> > > >
> > > > Thanks, and I agree with your statement, but at this point I think
> > > > only Intel can help out digging deeper as I see no way to debug this
> > > > further.
> > >
> > > I don't have anything against this patch, as long as the quirk stays
> > > limited to the particular root port leading to the NVIDIA GPU. The
> > > reason why I think it should to be limited is that I'm pretty certain
> > > the problem is not in the root port itself. I have here a KBL based
> > > Thinkpad X1 Carbon 6th gen that can put the TBT controller into D3cold
> > > (it is connected to PCH root port) and it wakes up there just fine, so
> > > don't want to break that.
> > >
> > > Now, PCIe devices cannot go into D3cold all by themselves. They always
> > > need help from the platform side which is ACPI in this case. This is
> > > done by having the device to have _PR3 method that returns one or more
> > > power resources that the OS is supposed to turn off when the device is
> > > put into D3cold. All of that is implemented as form of ACPI methods that
> > > pretty much do the hardware specific things that are outside of PCIe
> > > spec to get the device into D3cold. At high level the _OFF() method
> > > causes the root port to broadcast PME_Turn_Off message that results the
> > > link to enter L2/3 ready, it then asserts PERST, configures WAKE (both
> > > can be GPIOs) and finally removes power (if the link goes into L3,
> > > otherwise it goes into L2).
> > >
> > > I think this is where the problem actually lies - the ASL methods that
> > > are used to put the device into D3cold and back. We know that in Windows
> > > this all works fine so unless Windows quirks the root port the same way
> > > there is another reason behind this.
> > >
> > > In case of Dell XPS 9560 (IIRC that's the machine you have) the
> > > corresponding power resource is called \_SB.PCI0.PEG0.PG00 and its
> > > _ON/_OFF methods end up calling PGON()/PGOF() accordingly. The methods
> > > itself do lots of things and it is hard to follow the dissassembled
> > > ASL which does not have any comments but there are couple of things that
> > > stand out where we may go into a different path. One of them is this in
> > > the PGOF() method:
> > >
> > >If (((OSYS <= 0x07D9) || ((OSYS == 0x07DF) && (_REV == 0x05
> > >
> > > The ((OSYS == 0x07DF) && (_REV == 0x05)) checks specifically for Linux
> > > (see [1] and 18d78b64fddc ("ACPI / init: Make it possible to override
> > > _REV")) so it might be that Dell people tested this at some point in
> > > Linux as well. Added Mario in case he has any ideas.
> > >
> > > Previously I suggested you to try the ACPI method tracing to see what
> > > happens inside PGOF(). Did you have time to try it? It may provide more
> > > information about that is happening inside those methods and hopefully
> > > point us to the root cause.
> > >
> > > Also if you haven't tried already passing acpi_rev_override in the
> > > command line makes the _REV to return 5 so it should go into the "Linux"
> > > path in PGOF().
> >
> > Oh, so does it look like we are trying to work around AML that tried
> > to work around some problematic behavior in Linux at one point?
>
> Yes, it looks like so if I read the ASL right.

OK, so that would call for a DMI-based quirk as the real cause for the
issue seems to be the AML in question, which means a firmware problem.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

2019-11-20 Thread Rafael J. Wysocki

On Wed, Nov 20, 2019 at 11:18 AM Mika Westerberg
 wrote:
>
> Hi Karol,
>
> On Tue, Nov 19, 2019 at 11:26:45PM +0100, Karol Herbst wrote:
> > On Tue, Nov 19, 2019 at 10:50 PM Bjorn Helgaas  wrote:
> > >
> > > [+cc Dave]
> > >
> > > On Thu, Oct 17, 2019 at 02:19:01PM +0200, Karol Herbst wrote:
> > > > Fixes state transitions of Nvidia Pascal GPUs from D3cold into higher 
> > > > device
> > > > states.
> > > >
> > > > v2: convert to pci_dev quirk
> > > > put a proper technical explanation of the issue as a in-code comment
> > > > v3: disable it only for certain combinations of intel and nvidia 
> > > > hardware
> > > > v4: simplify quirk by setting flag on the GPU itself
> > >
> > > I have zero confidence that we understand the real problem, but we do
> > > need to do something with this.  I'll merge it for v5.5 if we get the
> > > minor procedural stuff below straightened out.
> > >
> >
> > Thanks, and I agree with your statement, but at this point I think
> > only Intel can help out digging deeper as I see no way to debug this
> > further.
>
> I don't have anything against this patch, as long as the quirk stays
> limited to the particular root port leading to the NVIDIA GPU. The
> reason why I think it should to be limited is that I'm pretty certain
> the problem is not in the root port itself. I have here a KBL based
> Thinkpad X1 Carbon 6th gen that can put the TBT controller into D3cold
> (it is connected to PCH root port) and it wakes up there just fine, so
> don't want to break that.
>
> Now, PCIe devices cannot go into D3cold all by themselves. They always
> need help from the platform side which is ACPI in this case. This is
> done by having the device to have _PR3 method that returns one or more
> power resources that the OS is supposed to turn off when the device is
> put into D3cold. All of that is implemented as form of ACPI methods that
> pretty much do the hardware specific things that are outside of PCIe
> spec to get the device into D3cold. At high level the _OFF() method
> causes the root port to broadcast PME_Turn_Off message that results the
> link to enter L2/3 ready, it then asserts PERST, configures WAKE (both
> can be GPIOs) and finally removes power (if the link goes into L3,
> otherwise it goes into L2).
>
> I think this is where the problem actually lies - the ASL methods that
> are used to put the device into D3cold and back. We know that in Windows
> this all works fine so unless Windows quirks the root port the same way
> there is another reason behind this.
>
> In case of Dell XPS 9560 (IIRC that's the machine you have) the
> corresponding power resource is called \_SB.PCI0.PEG0.PG00 and its
> _ON/_OFF methods end up calling PGON()/PGOF() accordingly. The methods
> itself do lots of things and it is hard to follow the dissassembled
> ASL which does not have any comments but there are couple of things that
> stand out where we may go into a different path. One of them is this in
> the PGOF() method:
>
>If (((OSYS <= 0x07D9) || ((OSYS == 0x07DF) && (_REV == 0x05
>
> The ((OSYS == 0x07DF) && (_REV == 0x05)) checks specifically for Linux
> (see [1] and 18d78b64fddc ("ACPI / init: Make it possible to override
> _REV")) so it might be that Dell people tested this at some point in
> Linux as well. Added Mario in case he has any ideas.
>
> Previously I suggested you to try the ACPI method tracing to see what
> happens inside PGOF(). Did you have time to try it? It may provide more
> information about that is happening inside those methods and hopefully
> point us to the root cause.
>
> Also if you haven't tried already passing acpi_rev_override in the
> command line makes the _REV to return 5 so it should go into the "Linux"
> path in PGOF().

Oh, so does it look like we are trying to work around AML that tried
to work around some problematic behavior in Linux at one point?

> [1] 
> https://www.kernel.org/doc/html/latest/firmware-guide/acpi/osi.html#do-not-use-rev
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH 1/7] Revert "ACPI / OSI: Add OEM _OSI string to enable dGPU direct output"

2019-10-21 Thread Rafael J. Wysocki

On Mon, Oct 21, 2019 at 10:48 AM Karol Herbst  wrote:
>
> fyi: I decided to go for a different workaround to fix the runpm
> issues observed with nvidia gpus with nouveau in the "pci: prevent
> putting nvidia GPUs into lower device states on certain intel bridges"
> thread

OK, I've seen that.

> that's on the pci and pm mailing list. Maybe it makes sense to wait
> for that to land before actually removing the ACPI workarounds here?

Sounds reasonable.

> The workaround I had in this series didn't seem to be reliable enough,
> so I ditched that approached.

OK, please let me know when the _OSI string in question can be dropped.

> On Mon, Oct 21, 2019 at 10:14 AM Rafael J. Wysocki  wrote:
> >
> > On Mon, Oct 21, 2019 at 4:14 AM Alex Hung  wrote:
> > >
> > > We have done some tests on three of Intel + nVidia configuration
> > > systems with OEM _OSI strings removed - while some bugs are still
> > > observed, ex. one out of three has suspend/resume issues, no system
> > > crashes were observed - the biggest issue that worries us.
> > >
> > > The positive results give us confident to ack the removal of the OEM
> > > _OSI strings. While our tests were not able to cover all possible I+N
> > > systems, we are sure we can fix issues along the way. If there aren't
> > > systems that cannot be fixed without these OEM _OSI strings, these
> > > strings should probably enable with DMI quirks (possible future
> > > patches) so they won't affect others.
> > >
> > > Acked-by: Alex Hung 
> >
> > OK, thanks!
> >
> > I can queue this up or if it's better to route it through the DRM
> > tree, please do that (and let me know).
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH 1/7] Revert "ACPI / OSI: Add OEM _OSI string to enable dGPU direct output"

2019-10-21 Thread Rafael J. Wysocki

On Mon, Oct 21, 2019 at 4:14 AM Alex Hung  wrote:
>
> We have done some tests on three of Intel + nVidia configuration
> systems with OEM _OSI strings removed - while some bugs are still
> observed, ex. one out of three has suspend/resume issues, no system
> crashes were observed - the biggest issue that worries us.
>
> The positive results give us confident to ack the removal of the OEM
> _OSI strings. While our tests were not able to cover all possible I+N
> systems, we are sure we can fix issues along the way. If there aren't
> systems that cannot be fixed without these OEM _OSI strings, these
> strings should probably enable with DMI quirks (possible future
> patches) so they won't affect others.
>
> Acked-by: Alex Hung 

OK, thanks!

I can queue this up or if it's better to route it through the DRM
tree, please do that (and let me know).
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [RFC PATCH] pci: prevent putting pcie devices into lower device states on certain intel bridges

2019-10-03 Thread Rafael J. Wysocki

On Tuesday, October 1, 2019 12:00:50 PM CEST Karol Herbst wrote:
> On Tue, Oct 1, 2019 at 11:11 AM Mika Westerberg
>  wrote:
> >
> > On Tue, Oct 01, 2019 at 10:56:39AM +0200, Karol Herbst wrote:
> > > On Tue, Oct 1, 2019 at 10:47 AM Mika Westerberg
> > >  wrote:
> > > >
> > > > On Mon, Sep 30, 2019 at 06:36:12PM +0200, Karol Herbst wrote:
> > > > > On Mon, Sep 30, 2019 at 6:30 PM Mika Westerberg
> > > > >  wrote:
> > > > > >
> > > > > > On Mon, Sep 30, 2019 at 06:05:14PM +0200, Karol Herbst wrote:
> > > > > > > still happens with your patch applied. The machine simply gets 
> > > > > > > shut down.
> > > > > > >
> > > > > > > dmesg can be found here:
> > > > > > > https://gist.githubusercontent.com/karolherbst/40eb091c7b7b33ef993525de660f1a3b/raw/2380e31f566e93e5ba7c87ef545420965d4c492c/gistfile1.txt
> > > > > >
> > > > > > Looking your dmesg:
> > > > > >
> > > > > > Sep 30 17:24:27 kernel: nouveau :01:00.0: DRM: DCB version 4.1
> > > > > > Sep 30 17:24:27 kernel: nouveau :01:00.0: DRM: MM: using COPY 
> > > > > > for buffer copies
> > > > > > Sep 30 17:24:27 kernel: [drm] Initialized nouveau 1.3.1 20120801 
> > > > > > for :01:00.0 on minor 1
> > > > > >
> > > > > > I would assume it runtime suspends here. Then it wakes up because 
> > > > > > of PCI
> > > > > > access from userspace:
> > > > > >
> > > > > > Sep 30 17:24:42 kernel: pci_raw_set_power_state: 56 callbacks 
> > > > > > suppressed
> > > > > >
> > > > > > and for some reason it does not get resumed properly. There are 
> > > > > > also few
> > > > > > warnings from ACPI that might be relevant:
> > > > > >
> > > > > > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument 
> > > > > > #4 type mismatch - Found [Buffer], ACPI requires [Package] 
> > > > > > (20190509/nsarguments-59)
> > > > > > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: 
> > > > > > Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] 
> > > > > > (20190509/nsarguments-59)
> > > > > >
> > > > >
> > > > > afaik this is the case for essentially every laptop out there.
> > > >
> > > > OK, so they are harmless?
> > > >
> > >
> > > yes
> > >
> > > > > > This seems to be Dell XPS 9560 which I think has been around some 
> > > > > > time
> > > > > > already so I wonder why we only see issues now. Has it ever worked 
> > > > > > for
> > > > > > you or maybe there is a regression that causes it to happen now?
> > > > >
> > > > > oh, it's broken since forever, we just tried to get more information
> > > > > from Nvidia if they know what this is all about, but we got nothing
> > > > > useful.
> > > > >
> > > > > We were also hoping to find a reliable fix or workaround we could have
> > > > > inside nouveau to fix that as I think nouveau is the only driver
> > > > > actually hit by this issue, but nothing turned out to be reliable
> > > > > enough.
> > > >
> > > > Can't you just block runtime PM from the nouveau driver until this is
> > > > understood better? That can be done by calling pm_runtime_forbid() (or
> > > > not calling pm_runtime_allow() in the driver). Or in case of PCI driver
> > > > you just don't decrease the reference count when probe() ends.
> > > >
> > >
> > > the thing is, it does work for a lot of laptops. We could only observe
> > > this on kaby lake and skylake ones. Even on Cannon Lakes it seems to
> > > work just fine.
> >
> > Can't you then limit it to those?
> >
> > I've experienced that Kabylake root ports can enter and exit in D3cold
> > just fine because we do that for Thunderbolt for example. But that
> > always requires help from ACPI. If the system is using non-standard ACPI
> > methods for example that may require some tricks in the driver side.
> >
> 
> yeah.. I am not quite sure what's actually the root cause. I was also
> trying to use the same PCI registers ACPI is using to trigger this
> issue on a normal desktop, no luck. Using the same registers does
> trigger the issue (hence the script).
> 
> The script is essentially just doing what ACPI does, just skipping a lot.
> 
> > > > I think that would be much better than blocking any devices behind
> > > > Kabylake PCIe root ports from entering D3 (I don't really think the
> > > > problem is in the root ports itself but there is something we are
> > > > missing when the NVIDIA GPU is put into D3cold or back from there).
> > >
> > > I highly doubt there is anything wrong with the GPU alone as we have
> > > too many indications which tell us otherwise.
> > >
> > > Anyway, at this point I don't know where to look further for what's
> > > actually wrong. And apparently it works on Windows, but I don't know
> > > why and I have no idea what Windows does on such systems to make it
> > > work reliably.
> >
> > By works you mean that Windows is able to put it into D3cold and back?
> > If that's the case it may be that there is some ACPI magic that the
> > Windows driver does and we of course are missing in Linux.
> 
> Afaik that's the case. We were

Re: [Nouveau] [RFC PATCH] pci: prevent putting pcie devices into lower device states on certain intel bridges

2019-10-02 Thread Rafael J. Wysocki

On Tue, Oct 1, 2019 at 9:34 PM Bjorn Helgaas  wrote:
>
> On Tue, Oct 01, 2019 at 06:21:28PM +0200, Karol Herbst wrote:
> > On Tue, Oct 1, 2019 at 3:27 PM Bjorn Helgaas  wrote:
> > > On Mon, Sep 30, 2019 at 06:36:12PM +0200, Karol Herbst wrote:
> > > > On Mon, Sep 30, 2019 at 6:30 PM Mika Westerberg
> > > >  wrote:
> > > > >
> > > > > On Mon, Sep 30, 2019 at 06:05:14PM +0200, Karol Herbst wrote:
> > > > > > still happens with your patch applied. The machine simply gets shut 
> > > > > > down.
> > > > > >
> > > > > > dmesg can be found here:
> > > > > > https://gist.githubusercontent.com/karolherbst/40eb091c7b7b33ef993525de660f1a3b/raw/2380e31f566e93e5ba7c87ef545420965d4c492c/gistfile1.txt
> > > > >
> > > > > Looking your dmesg:
> > > > >
> > > > > Sep 30 17:24:27 kernel: nouveau :01:00.0: DRM: DCB version 4.1
> > > > > Sep 30 17:24:27 kernel: nouveau :01:00.0: DRM: MM: using COPY for 
> > > > > buffer copies
> > > > > Sep 30 17:24:27 kernel: [drm] Initialized nouveau 1.3.1 20120801 for 
> > > > > :01:00.0 on minor 1
> > > > >
> > > > > I would assume it runtime suspends here. Then it wakes up because of 
> > > > > PCI
> > > > > access from userspace:
> > > > >
> > > > > Sep 30 17:24:42 kernel: pci_raw_set_power_state: 56 callbacks 
> > > > > suppressed
> > > > >
> > > > > and for some reason it does not get resumed properly. There are also 
> > > > > few
> > > > > warnings from ACPI that might be relevant:
> > > > >
> > > > > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument 
> > > > > #4 type mismatch - Found [Buffer], ACPI requires [Package] 
> > > > > (20190509/nsarguments-59)
> > > > > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: 
> > > > > Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] 
> > > > > (20190509/nsarguments-59)
> > > >
> > > > afaik this is the case for essentially every laptop out there.
> > >
> > > I think we should look into this a little bit.
> > > acpi_ns_check_argument_types() checks the argument type and prints
> > > this message, but AFAICT it doesn't actually fix anything or prevent
> > > execution of the method, so I have no idea what happens when we
> > > actually execute the _DSM.
> >
> > I can assure you that this warning happens on every single laptop out
> > there with dual Nvidia graphics and it's essentially just a firmware
> > bug. And it never caused any issues on any of the older laptops (or
> > newest one for that matter).
>
> Rafael, do you know anything about this?

IIRC ACPICA will simply run the method with the assumption that the
AML in there will deal with the arguments properly anyway.

> If ACPI has some sort of workaround so it can execute the method correctly
> anyway, maybe we should remove or reword the warning?

I can agree that printing these warnings on a user system by default
is not very useful, at least as long as no visible functional issues
are present, but if there are such issues, it is good to know that
something fishy is going on.  For instance, while the method may
execute successfully, the result of that may not be as expected.

So maybe they should be debug level or similar.

> Or if this does prevent execution of the method, maybe we need to add
> a workaround since the problem is so prevalent in the field?

As par the above, no workarounds should be needed, but I'll let Bob
and Erik (CCed now) confirm or deny this.

A side note: please CC all discussions regarding general ACPI issues
to linux-acpi, so they can get the attention of all of the right
people (who may not subscribe linux-pci, for example).
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH 1/7] Revert "ACPI / OSI: Add OEM _OSI string to enable dGPU direct output"

2019-09-05 Thread Rafael J. Wysocki

On Thursday, September 5, 2019 5:51:23 PM CEST Karol Herbst wrote:
> is there any update on the testing with my patches? On the hardware I
> had access to those patches helped, but I can't know if it also helped
> on the hardware for which those workarounds where actually added.

Alex Hung and Mario need to answer this question I think.

> On Mon, Aug 19, 2019 at 11:52 AM Rafael J. Wysocki  wrote:
> >
> > On Thursday, August 15, 2019 12:47:35 AM CEST Dave Airlie wrote:
> > > On Thu, 15 Aug 2019 at 07:31, Karol Herbst  wrote:
> > > >
> > > > This reverts commit 28586a51eea666d5531bcaef2f68e4abbd87242c.
> > > >
> > > > The original commit message didn't even make sense. AMD _does_ support 
> > > > it and
> > > > it works with Nouveau as well.
> > > >
> > > > Also what was the issue being solved here? No references to any bugs 
> > > > and not
> > > > even explaining any issue at all isn't the way we do things.
> > > >
> > > > And even if it means a muxed design, then the fix is to make it work 
> > > > inside the
> > > > driver, not adding some hacky workaround through ACPI tricks.
> > > >
> > > > And what out of tree drivers do or do not support we don't care one bit 
> > > > anyway.
> > > >
> > >
> > > I think the reverts should be merged via Rafael's tree as the original
> > > patches went in via there, and we should get them in asap.
> > >
> > > Acked-by: Dave Airlie 
> >
> > The _OSI strings are to be dropped when all of the needed support is there 
> > in
> > drivers, so they should go away along with the requisite driver changes.
> >
> 
> that goes beside the point. firmware level workarounds for GPU driver
> issues were pushed without consulting with upstream GPU developers.
> That's something which shouldn't have happened in the first place. And
> yes, I am personally annoyed by the fact, that people know about
> issues, but instead of contacting the proper persons and working on a
> proper fix, we end up with stupid firmware level workarounds. I can't
> see why we ever would have wanted such workarounds in the first place.
> 
> And I would be much happier if the next time something like that comes
> up, that the drm mailing list will be contacted as well or somebody
> involved.
> 
> We could have also just disable the feature inside the driver (and
> probably we should have done that a long time ago, so that is
> essentially our fault, but still)
> 
> > I'm all for dropping then when that's the case, so please feel free to add 
> > ACKs
> > from me to the patches in question at that point.
> >
> > Cheers,
> > Rafael
> >
> >
> >
> 




___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH 1/7] Revert "ACPI / OSI: Add OEM _OSI string to enable dGPU direct output"

2019-08-19 Thread Rafael J. Wysocki

On Thursday, August 15, 2019 12:47:35 AM CEST Dave Airlie wrote:
> On Thu, 15 Aug 2019 at 07:31, Karol Herbst  wrote:
> >
> > This reverts commit 28586a51eea666d5531bcaef2f68e4abbd87242c.
> >
> > The original commit message didn't even make sense. AMD _does_ support it 
> > and
> > it works with Nouveau as well.
> >
> > Also what was the issue being solved here? No references to any bugs and not
> > even explaining any issue at all isn't the way we do things.
> >
> > And even if it means a muxed design, then the fix is to make it work inside 
> > the
> > driver, not adding some hacky workaround through ACPI tricks.
> >
> > And what out of tree drivers do or do not support we don't care one bit 
> > anyway.
> >
> 
> I think the reverts should be merged via Rafael's tree as the original
> patches went in via there, and we should get them in asap.
> 
> Acked-by: Dave Airlie 

The _OSI strings are to be dropped when all of the needed support is there in
drivers, so they should go away along with the requisite driver changes.

I'm all for dropping then when that's the case, so please feel free to add ACKs
from me to the patches in question at that point.

Cheers,
Rafael

Re: [Nouveau] [PATCH v3] PCI: Reprogram bridge prefetch registers on resume

2018-09-13 Thread Rafael J. Wysocki

On Thu, Sep 13, 2018 at 5:37 AM Daniel Drake  wrote:
>
> On 38+ Intel-based Asus products, the nvidia GPU becomes unusable
> after S3 suspend/resume. The affected products include multiple
> generations of nvidia GPUs and Intel SoCs. After resume, nouveau logs
> many errors such as:
>
> fifo: fault 00 [READ] at 00555000 engine 00 [GR] client 04
>   [HUB/FE] reason 4a [] on channel -1 [007fa91000 unknown]
> DRM: failed to idle channel 0 [DRM]
>
> Similarly, the nvidia proprietary driver also fails after resume
> (black screen, 100% CPU usage in Xorg process). We shipped a sample
> to Nvidia for diagnosis, and their response indicated that it's a
> problem with the parent PCI bridge (on the Intel SoC), not the GPU.
>
> Runtime suspend/resume works fine, only S3 suspend is affected.
>
> We found a workaround: on resume, rewrite the Intel PCI bridge
> 'Prefetchable Base Upper 32 Bits' register (PCI_PREF_BASE_UPPER32). In
> the cases that I checked, this register has value 0 and we just have to
> rewrite that value.
>
> Linux already saves and restores PCI config space during suspend/resume,
> but this register was being skipped because upon resume, it already
> has value 0 (the correct, pre-suspend value).
>
> Intel appear to have previously acknowledged this behaviour and the
> requirement to rewrite this register.
> https://bugzilla.kernel.org/show_bug.cgi?id=116851#c23
>
> Based on that, rewrite the prefetch register values even when that
> appears unnecessary.
>
> We have confirmed this solution on all the affected models we have
> in-hands (X542UQ, UX533FD, X530UN, V272UN).
>
> Additionally, this solves an issue where r8169 MSI-X interrupts were
> broken after S3 suspend/resume on Asus X441UAR. This issue was recently
> worked around in commit 7bb05b85bc2d ("r8169: don't use MSI-X on
> RTL8106e"). It also fixes the same issue on RTL6186evl/8111evl on an
> Aimfor-tech laptop that we had not yet patched. I suspect it will also
> fix the issue that was worked around in commit 7c53a722459c ("r8169:
> don't use MSI-X on RTL8168g").
>
> Thomas Martitz reports that this change also solves an issue where
> the AMD Radeon Polaris 10 GPU on the HP Zbook 14u G5 is unresponsive
> after S3 suspend/resume.
>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=201069
> Signed-off-by: Daniel Drake 

Still

Reviewed-by: Rafael J. Wysocki 

> ---
>  drivers/pci/pci.c | 25 +
>  1 file changed, 17 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 29ff9619b5fa..5d58220b6997 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1289,12 +1289,12 @@ int pci_save_state(struct pci_dev *dev)
>  EXPORT_SYMBOL(pci_save_state);
>
>  static void pci_restore_config_dword(struct pci_dev *pdev, int offset,
> -u32 saved_val, int retry)
> +u32 saved_val, int retry, bool force)
>  {
> u32 val;
>
> pci_read_config_dword(pdev, offset, );
> -   if (val == saved_val)
> +   if (!force && val == saved_val)
> return;
>
> for (;;) {
> @@ -1313,25 +1313,34 @@ static void pci_restore_config_dword(struct pci_dev 
> *pdev, int offset,
>  }
>
>  static void pci_restore_config_space_range(struct pci_dev *pdev,
> -  int start, int end, int retry)
> +  int start, int end, int retry,
> +  bool force)
>  {
> int index;
>
> for (index = end; index >= start; index--)
> pci_restore_config_dword(pdev, 4 * index,
>  pdev->saved_config_space[index],
> -retry);
> +retry, force);
>  }
>
>  static void pci_restore_config_space(struct pci_dev *pdev)
>  {
> if (pdev->hdr_type == PCI_HEADER_TYPE_NORMAL) {
> -   pci_restore_config_space_range(pdev, 10, 15, 0);
> +   pci_restore_config_space_range(pdev, 10, 15, 0, false);
> /* Restore BARs before the command register. */
> -   pci_restore_config_space_range(pdev, 4, 9, 10);
> -   pci_restore_config_space_range(pdev, 0, 3, 0);
> +   pci_restore_config_space_range(pdev, 4, 9, 10, false);
> +   pci_restore_config_space_range(pdev, 0, 3, 0, false);
> +   } else if (pdev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
> +   pci_restore_config_space_range(pdev, 12, 15, 0, false);
> +

Re: [Nouveau] [PATCH v2] PCI: Reprogram bridge prefetch registers on resume

2018-09-12 Thread Rafael J. Wysocki

dev->hdr_type == PCI_HEADER_TYPE_NORMAL) {
> -   pci_restore_config_space_range(pdev, 10, 15, 0);
> +   pci_restore_config_space_range(pdev, 10, 15, 0, false);
> /* Restore BARs before the command register. */
> -   pci_restore_config_space_range(pdev, 4, 9, 10);
> -   pci_restore_config_space_range(pdev, 0, 3, 0);
> +   pci_restore_config_space_range(pdev, 4, 9, 10, false);
> +   pci_restore_config_space_range(pdev, 0, 3, 0, false);
> +   } else if (pdev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
> +   pci_restore_config_space_range(pdev, 12, 15, 0, false);
> +   /* Force rewriting of prefetch registers to avoid
> +* S3 resume issues on Intel PCI bridges that occur when
> +* these registers are not explicitly written.
> +*/
> +   pci_restore_config_space_range(pdev, 9, 11, 0, true);
> +   pci_restore_config_space_range(pdev, 0, 8, 0, false);
> } else {
> -   pci_restore_config_space_range(pdev, 0, 15, 0);
> +   pci_restore_config_space_range(pdev, 0, 15, 0, false);
> }
>  }
>
> --

Passing the extra bool around is somewhat clumsy, but I haven't found
a cleaner way to do it, so

Reviewed-by: Rafael J. Wysocki 
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH] PCI: Reprogram bridge prefetch registers on resume

2018-09-11 Thread Rafael J. Wysocki

On Tuesday, September 11, 2018 5:35:13 AM CEST Daniel Drake wrote:
> I have created https://bugzilla.kernel.org/show_bug.cgi?id=201069 to
> archive the research done so far.
> 
> On Fri, Sep 7, 2018 at 11:05 PM, Peter Wu  wrote:
> > Windows 10 unconditionally rewrites these registers (BAR, I/O Base +
> > Limit, Memory Base + Limit, etc. from top to bottom), see annotations:
> > https://www.spinics.net/lists/linux-pci/msg75856.html
> >
> > Linux has a generic "restore" operation that works backwards from the
> > end of the PCI config space to the beginning, see
> > pci_restore_config_space. Do you have a dmesg where you see the
> > "restoring config space at offset" messages?
> 
> Interesting, I had not spotted this code. The logs for the affected
> bridge on Asus X542UQ:
> 
>  :00:1c.0: restoring config space at offset 0x3c (was 0x100,
> writing 0x1001ff)
>  :00:1c.0: restoring config space at offset 0x24 (was 0x10001,
> writing 0xe1f1d001)
>  :00:1c.0: restoring config space at offset 0x20 (was 0x0, writing
> 0xef00ee00)
>  :00:1c.0: restoring config space at offset 0xc (was 0x81,
> writing 0x810010)
>  :00:1c.0: restoring config space at offset 0x4 (was 0x17,
> writing 0x100407)
> 
> So it didn't restore 0x28 (the magic register) and also register 0x2c
> (prefetch limit upper 32 bits) because their values were already 0 on
> resume.
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=116851#c23 has an
> assertion that Intel recognises this issue and calls for all those
> registers to be restored on resume.
> 
> I propose for Linux 4.19 we apply a minimally intrusive change that
> "forcibly" restores dword registers 0x24, 0x28 and 0x2c for PCI
> bridges (regardless of their current value), while retaining the
> existing restore order (high to low) over all registers and doing the
> read-then-maybe-write thing for all of the other regs (per current
> behaviour).

I agree here.  It would be good to cover this gap and possibly backport
the fix to "stable".

> Then we also consider swiftly applying a followup patch implementing a
> more thorough approach and we give it some time in linux-next before
> releasing in Linux 4.20 which does something more thorough. I think
> perhaps more discussion is needed there, or at least some more input
> from Bjorn. Seems like we have 3 approaches to consider:
> 
> 1. Peter Wu suggests following what windows does. Windows appears to
> start with low registers and works its way upwards, which means it
> writes BAR addresses, I/O base, etc, before writing prefetch
> registers. It skips over read-only and write-to-clear registers and
> also only writes some of the registers at the very end - like the
> command register.
> 
> To be thorough here we would probably also have to study and copy what
> Windows does for non-bridge devices (not sure how many device classes
> we would want to study here?). Also it is a bit risky in that Bjorn
> has pointed out that Linux's earlier approach with a high level of
> similarity here (restore registers in ascending order, regardless of
> their current value) caused a laptop to hang on resume.
> 
> 
> 2. Bjorn suggested tweaking the existing code to always write the
> saved value even when the hardware has that same value. The
> read-maybe-write logic was introduced to avoid a laptop hang, but at
> that time we were restoring registers in ascending order, now we are
> descending it might be worth giving this another try.

I agree with Bjorn here.

> 3. Do nothing else beyond the minimal change that I propose for v4.19?
> Looking a bit more into git history this seems to be a sensitive and
> problematic area, more changes might mean more trouble. For example
> right now pci_restore_config_space() does:
> pci_restore_config_space_range(pdev, 10, 15, 0, 0);
> /* Restore BARs before the command register. */
> pci_restore_config_space_range(pdev, 4, 9, 10, 0);
> pci_restore_config_space_range(pdev, 0, 3, 0, 0);
> but pci_restore_config_space_range() already goes in descending order,
> so the above is already equivalent to the code in the "else" path that
> follows:
> pci_restore_config_space_range(pdev, 0, 15, 0, 0);
> 
> and if you look at the history behind this "mixup" there's stuff that
> goes back to 2012 like commit a6cb9ee7cabe ("PCI: Retry BARs
> restoration for Type 0 headers only") which was fixing up another
> problematic commit in this area, etc.

So let's first fix what's known broken and see how this goes IMO.

On top of that and later, try to make the changes suggested by Bjorn and see
how far we can get with that.

Having done that, revisit and see if there still are any problems to address
in this area.

Thanks,
Rafael

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH 1/5] drm/nouveau: Prevent RPM callback recursion in suspend/resume paths

2018-07-18 Thread Rafael J. Wysocki

On Wed, Jul 18, 2018 at 10:11 PM, Lyude Paul  wrote:
> On Wed, 2018-07-18 at 10:36 +0200, Lukas Wunner wrote:
>> On Wed, Jul 18, 2018 at 10:25:05AM +0200, Lukas Wunner wrote:
>> > The GPU contains an i2c subdevice for each connector with DDC lines.
>> > I believe those are modelled as children of the GPU's PCI device as
>> > they're accessed via mmio of the PCI device.
>> >
>> > The problem here is that when the GPU's PCI device runtime suspends,
>> > its i2c child device needs to be runtime active to suspend the MST
>> > topology.  Catch-22.
>> >
>> > I don't know whether or not it's necessary to suspend the MST topology.
>> > I'm not an expert on DisplayPort MultiStream transport.
>> >
>> > BTW Lyude, in patch 4 and 5 of this series, you're runtime resuming
>> > pad->i2c->subdev.device->dev.  Is this the PCI device or is it the i2c
>> > device?  I'm always confused by nouveau's structs.  In nvkm_i2c_bus_ctor()
>> > I can see that the device you're runtime resuming is the parent of the
>> > i2c_adapter:
>> >
>> > struct nvkm_device *device = pad->i2c->subdev.device;
>> > [...]
>> > bus->i2c.dev.parent = device->dev;
>> >
>> > If the i2c_adapter is a child of the PCI device, it's sufficient
>> > to runtime resume the i2c_adapter, i.e. bus->i2c.dev, and this will
>> > implicitly runtime resume its parent.
>>
>> Actually, having written all this I just remembered that we have this
>> in the documentation:
>>
>> 8. "No-Callback" Devices
>>
>> Some "devices" are only logical sub-devices of their parent and cannot
>> be
>> power-managed on their own. [...]
>>
>> Subsystems can tell the PM core about these devices by calling
>> pm_runtime_no_callbacks().
>>
>> So it might actually be sufficient to just call pm_runtime_no_callbacks()
>
> I would have hoped so, but unfortunately it seems that
> pm_runtime_no_callbacks() is already called by default for i2c adapters in
> i2c_register_adapter(). Unfortunately this really can't fix the problem
> though, because it will still try to runtime resume the parent device of the
> i2c adapter, which still leads to deadlocking in the runtime suspend/resume
> path.

Well, there has to be a way to suspend all that thing without
recursion or similar.

If the adapter has no callbacks, then how is it possible for those
callbacks to invoke any runtime PM helpers for any other devices?

> Additionally; I did play around with ignore_children, but unfortunately this
> isn't good enough either as it just means that our i2c devices won't wake the
> GPU up on access.

So on the one hand you want them to stay active over a suspend of the
parent and on the other hand you want  the parent to resume before
them.  Are these requirements really consistent with each other?

> I'm pretty stumped here on trying to figure out any clean way to handle this
> in the PM core if recursive resume calls are off the table. The only possible
> solution I could see to this is if we could disable execution of runtime
> callbacks in the context of a certain task (while all other tasks have to
> honor the runtime PM callbacks), do what we need to do in suspend, then re-
> enable them
>> for the i2c devices...

This sounds completely broken to me, sorry.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH 1/5] drm/nouveau: Prevent RPM callback recursion in suspend/resume paths

2018-07-18 Thread Rafael J. Wysocki

On Wed, Jul 18, 2018 at 10:25 AM, Lukas Wunner  wrote:
> On Wed, Jul 18, 2018 at 09:38:41AM +0200, Rafael J. Wysocki wrote:
>> On Tue, Jul 17, 2018 at 8:20 PM, Lukas Wunner  wrote:
>> > Okay, the PCI device is suspending and the nvkm_i2c_aux_acquire()
>> > wants it in resumed state, so is waiting forever for the device to
>> > runtime suspend in order to resume it again immediately afterwards.
>> >
>> > The deadlock in the stack trace you've posted could be resolved using
>> > the technique I used in d61a5c106351 by adding the following to
>> > include/linux/pm_runtime.h:
>> >
>> > static inline bool pm_runtime_status_suspending(struct device *dev)
>> > {
>> > return dev->power.runtime_status == RPM_SUSPENDING;
>> > }
>> >
>> > static inline bool is_pm_work(struct device *dev)
>> > {
>> > struct work_struct *work = current_work();
>> >
>> > return work && work->func == dev->power.work;
>> > }
>> >
>> > Then adding this to nvkm_i2c_aux_acquire():
>> >
>> > struct device *dev = pad->i2c->subdev.device->dev;
>> >
>> > if (!(is_pm_work(dev) && pm_runtime_status_suspending(dev))) {
>> > ret = pm_runtime_get_sync(dev);
>> > if (ret < 0 && ret != -EACCES)
>> > return ret;
>> > }
> [snip]
>>
>> For the record, I don't quite like this approach as it seems to be
>> working around a broken dependency graph.
>>
>> If you need to resume device A from within the runtime resume callback
>> of device B, then clearly B depends on A and there should be a link
>> between them.
>>
>> That said, I do realize that it may be the path of least resistance,
>> but then I wonder if we can do better than this.
>
> The GPU contains an i2c subdevice for each connector with DDC lines.
> I believe those are modelled as children of the GPU's PCI device as
> they're accessed via mmio of the PCI device.
>
> The problem here is that when the GPU's PCI device runtime suspends,
> its i2c child device needs to be runtime active to suspend the MST
> topology.  Catch-22.

I see.

This sounds like a case for the ignore_children flag, maybe in a
slightly modified form, that will allow the parent to be suspended
regardless of the state of the children.

I wonder what happens to the I2C subdevices when the PCI device goes
into D3.  They are not accessible through MMIO any more then, so how
can they be suspended then?  Or do they need to be suspended at all?

> I don't know whether or not it's necessary to suspend the MST topology.
> I'm not an expert on DisplayPort MultiStream transport.

Me neither. :-)
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH 1/5] drm/nouveau: Prevent RPM callback recursion in suspend/resume paths

2018-07-18 Thread Rafael J. Wysocki

On Tue, Jul 17, 2018 at 8:34 PM, Lyude Paul  wrote:
> On Tue, 2018-07-17 at 20:32 +0200, Lukas Wunner wrote:
>> On Tue, Jul 17, 2018 at 02:24:31PM -0400, Lyude Paul wrote:
>> > On Tue, 2018-07-17 at 20:20 +0200, Lukas Wunner wrote:
>> > > Okay, the PCI device is suspending and the nvkm_i2c_aux_acquire()
>> > > wants it in resumed state, so is waiting forever for the device to
>> > > runtime suspend in order to resume it again immediately afterwards.
>> > >
>> > > The deadlock in the stack trace you've posted could be resolved using
>> > > the technique I used in d61a5c106351 by adding the following to
>> > > include/linux/pm_runtime.h:
>> > >
>> > > static inline bool pm_runtime_status_suspending(struct device *dev)
>> > > {
>> > >   return dev->power.runtime_status == RPM_SUSPENDING;
>> > > }
>> > >
>> > > static inline bool is_pm_work(struct device *dev)
>> > > {
>> > >   struct work_struct *work = current_work();
>> > >
>> > >   return work && work->func == dev->power.work;
>> > > }
>> > >
>> > > Then adding this to nvkm_i2c_aux_acquire():
>> > >
>> > >   struct device *dev = pad->i2c->subdev.device->dev;
>> > >
>> > >   if (!(is_pm_work(dev) && pm_runtime_status_suspending(dev))) {
>> > >   ret = pm_runtime_get_sync(dev);
>> > >   if (ret < 0 && ret != -EACCES)
>> > >   return ret;
>> > >   }
>> > >
>> > > But here's the catch:  This only works for an *async* runtime suspend.
>> > > It doesn't work for pm_runtime_put_sync(), pm_runtime_suspend() etc,
>> > > because then the runtime suspend is executed in the context of the 
>> > > caller,
>> > > not in the context of dev->power.work.
>> > >
>> > > So it's not a full solution, but hopefully something that gets you
>> > > going.  I'm not really familiar with the code paths leading to
>> > > nvkm_i2c_aux_acquire() to come up with a full solution off the top
>> > > of my head I'm afraid.
>> >
>> > OK-I was considering doing something similar to that commit beforehand but 
>> > I
>> > wasn't sure if I was going to just be hacking around an actual issue. That
>> > doesn't seem to be the case. This is very helpful and hopefully I should be
>> > able
>> > to figure something out from this, thanks!
>>
>> In some cases, the function acquiring the runtime PM ref is only called
>> from a couple of places and then it would be feasible and appropriate
>> to add a bool parameter to the function telling it to acquire the ref
>> or not.  So the function is told using a parameter which context it's
>> running in:  In the runtime_suspend code path or some other code path.
>>
>> The technique to use current_work() is an alternative approach to figure
>> out the context if passing in an additional parameter is not feasible
>> for some reason.  That was the case with d61a5c106351.  That approach
>> only works for work items though.
>
> Something I'm curious about. This isn't the first time I've hit a situation 
> like
> this (see: the improper disable_depth fix I added into amdgpu I now need to go
> and fix), which makes me wonder: is there actually any reason Linux's runtime 
> PM
> core doesn't just turn get/puts() in the context of s/r callbacks into no-ops 
> by
> default?

Because it's hard to detect reliably enough and because hiding issues
is a bad idea in general.

As I've just said in the message to Lukas, the fact that you need to
resume another device from within your resume callback indicates that
you're hiding your dependency graph from the core.

Thanks,
Rafael
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH 1/5] drm/nouveau: Prevent RPM callback recursion in suspend/resume paths

2018-07-18 Thread Rafael J. Wysocki

On Tue, Jul 17, 2018 at 8:20 PM, Lukas Wunner  wrote:
> On Tue, Jul 17, 2018 at 12:53:11PM -0400, Lyude Paul wrote:
>> On Tue, 2018-07-17 at 09:16 +0200, Lukas Wunner wrote:
>> > On Mon, Jul 16, 2018 at 07:59:25PM -0400, Lyude Paul wrote:
>> > > In order to fix all of the spots that need to have runtime PM get/puts()
>> > > added, we need to ensure that it's possible for us to call
>> > > pm_runtime_get/put() in any context, regardless of how deep, since
>> > > almost all of the spots that are currently missing refs can potentially
>> > > get called in the runtime suspend/resume path. Otherwise, we'll try to
>> > > resume the GPU as we're trying to resume the GPU (and vice-versa) and
>> > > cause the kernel to deadlock.
>> > >
>> > > With this, it should be safe to call the pm runtime functions in any
>> > > context in nouveau with one condition: any point in the driver that
>> > > calls pm_runtime_get*() cannot hold any locks owned by nouveau that
>> > > would be acquired anywhere inside nouveau_pmops_runtime_resume().
>> > > This includes modesetting locks, i2c bus locks, etc.
>> >
>> > [snip]
>> > > --- a/drivers/gpu/drm/nouveau/nouveau_drm.c
>> > > +++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
>> > > @@ -835,6 +835,8 @@ nouveau_pmops_runtime_suspend(struct device *dev)
>> > >   return -EBUSY;
>> > >   }
>> > >
>> > > + dev->power.disable_depth++;
>> > > +
>> >
>> > Anyway, if I understand the commit message correctly, you're hitting a
>> > pm_runtime_get_sync() in a code path that itself is called during a
>> > pm_runtime_get_sync().  Could you include stack traces in the commit
>> > message?  My gut feeling is that this patch masks a deeper issue,
>> > e.g. if the runtime_resume code path does in fact directly poll outputs,
>> > that would seem wrong.  Runtime resume should merely make the card
>> > accessible, i.e. reinstate power if necessary, put into PCI_D0,
>> > restore registers, etc.  Output polling should be scheduled
>> > asynchronously.
>>
>> So: the reason that patch was added was mainly for the patches later in the
>> series that add guards around the i2c bus and aux bus, since both of those
>> require that the device be awake for it to work. Currently, the spot where it
>> would recurse is:
>
> Okay, the PCI device is suspending and the nvkm_i2c_aux_acquire()
> wants it in resumed state, so is waiting forever for the device to
> runtime suspend in order to resume it again immediately afterwards.
>
> The deadlock in the stack trace you've posted could be resolved using
> the technique I used in d61a5c106351 by adding the following to
> include/linux/pm_runtime.h:
>
> static inline bool pm_runtime_status_suspending(struct device *dev)
> {
> return dev->power.runtime_status == RPM_SUSPENDING;
> }
>
> static inline bool is_pm_work(struct device *dev)
> {
> struct work_struct *work = current_work();
>
> return work && work->func == dev->power.work;
> }
>
> Then adding this to nvkm_i2c_aux_acquire():
>
> struct device *dev = pad->i2c->subdev.device->dev;
>
> if (!(is_pm_work(dev) && pm_runtime_status_suspending(dev))) {
> ret = pm_runtime_get_sync(dev);
> if (ret < 0 && ret != -EACCES)
> return ret;
> }
>
> But here's the catch:  This only works for an *async* runtime suspend.
> It doesn't work for pm_runtime_put_sync(), pm_runtime_suspend() etc,
> because then the runtime suspend is executed in the context of the caller,
> not in the context of dev->power.work.
>
> So it's not a full solution, but hopefully something that gets you
> going.  I'm not really familiar with the code paths leading to
> nvkm_i2c_aux_acquire() to come up with a full solution off the top
> of my head I'm afraid.
>
> Note, it's not sufficient to just check pm_runtime_status_suspending(dev)
> because if the runtime_suspend is carried out concurrently by something
> else, this will return true but it's not guaranteed that the device is
> actually kept awake until the i2c communication has been fully performed.

For the record, I don't quite like this approach as it seems to be
working around a broken dependency graph.

If you need to resume device A from within the runtime resume callback
of device B, then clearly B depends on A and there should be a link
between them.

That said, I do realize that it may be the path of least resistance,
but then I wonder if we can do better than this.

Thanks,
Rafael
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH 1/5] drm/nouveau: Prevent RPM callback recursion in suspend/resume paths

2018-07-17 Thread Rafael J. Wysocki

On Tue, Jul 17, 2018 at 9:16 AM, Lukas Wunner  wrote:
> [cc += linux-pm]
>
> Hi Lyude,
>
> First of all, thanks a lot for looking into this.
>
> On Mon, Jul 16, 2018 at 07:59:25PM -0400, Lyude Paul wrote:
>> In order to fix all of the spots that need to have runtime PM get/puts()
>> added, we need to ensure that it's possible for us to call
>> pm_runtime_get/put() in any context, regardless of how deep, since
>> almost all of the spots that are currently missing refs can potentially
>> get called in the runtime suspend/resume path. Otherwise, we'll try to
>> resume the GPU as we're trying to resume the GPU (and vice-versa) and
>> cause the kernel to deadlock.
>>
>> With this, it should be safe to call the pm runtime functions in any
>> context in nouveau with one condition: any point in the driver that
>> calls pm_runtime_get*() cannot hold any locks owned by nouveau that
>> would be acquired anywhere inside nouveau_pmops_runtime_resume().
>> This includes modesetting locks, i2c bus locks, etc.
> [snip]
>> --- a/drivers/gpu/drm/nouveau/nouveau_drm.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
>> @@ -835,6 +835,8 @@ nouveau_pmops_runtime_suspend(struct device *dev)
>>   return -EBUSY;
>>   }
>>
>> + dev->power.disable_depth++;

This is effectively equivalent to __pm_runtime_disable(dev, false)
except for the locking (which is necessary).

>> +
>
> I'm not sure if that variable is actually private to the PM core.
> Grepping through the tree I only find a single occurrence where it's
> accessed outside the PM core and that's in amdgpu.  So this looks
> a little fishy TBH.  It may make sense to cc such patches to linux-pm
> to get Rafael & other folks involved with the PM core to comment.

You are right, power.disable_depth is internal to the PM core.
Accessing it (and updating it in particular) directly from drivers is
not a good idea.

> Also, the disable_depth variable only exists if the kernel was
> compiled with CONFIG_PM enabled, but I can't find a "depends on PM"
> or something like that in nouveau's Kconfig.  Actually, if PM is
> not selected, all the nouveau_pmops_*() functions should be #ifdef'ed
> away, but oddly there's no #ifdef CONFIG_PM anywhere in nouveau_drm.c.
>
> Anywayn, if I understand the commit message correctly, you're hitting a
> pm_runtime_get_sync() in a code path that itself is called during a
> pm_runtime_get_sync().  Could you include stack traces in the commit
> message?  My gut feeling is that this patch masks a deeper issue,
> e.g. if the runtime_resume code path does in fact directly poll outputs,
> that would seem wrong.  Runtime resume should merely make the card
> accessible, i.e. reinstate power if necessary, put into PCI_D0,
> restore registers, etc.  Output polling should be scheduled
> asynchronously.

Right.

Thanks,
Rafael
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH 1/7] PCI: Restore BARs on runtime resume despite being unbound

2018-02-21 Thread Rafael J. Wysocki

On Wednesday, February 21, 2018 10:57:14 AM CET Rafael J. Wysocki wrote:
> On Tue, Feb 20, 2018 at 10:29 PM, Bjorn Helgaas <helg...@kernel.org> wrote:
> > On Sun, Feb 18, 2018 at 09:38:32AM +0100, Lukas Wunner wrote:
> >> PCI devices not bound to a driver are supposed to stay in D0 during
> >> runtime suspend.
> >
> > Doesn't "runtime suspend" mean an individual device is suspended while
> > the rest of the system remains active?
> >
> > If so, maybe it would be more direct to say "PCI devices not bound to
> > a driver cannot be runtime suspended"?
> >
> > (It's a separate question not relevant to this patch, but I'm not
> > convinced that "PCI devices without a driver cannot be suspended"
> > should be accepted as a rule.  If it is a rule, we should be able to
> > deduce it from the specs.)
> >
> >> But they may have a parent which is bound and can be
> >> transitioned to D3cold at runtime.  Once the parent goes to D3cold, the
> >> unbound child may go to D3cold as well.  When the child comes out of
> >> D3cold, its BARs are uninitialized and thus inaccessible when a driver
> >> tries to probe.
> >>
> >> One example are recent hybrid graphics laptops which cut power to the
> >> discrete GPU when the root port above it goes to ACPI power state D3.
> >> Users may provoke this by unbinding the GPU driver and allowing runtime
> >> PM on the GPU via sysfs:  The PM core will then treat the GPU as
> >> "suspended", which in turn allows the root port to runtime suspend,
> >> causing the power resources listed in its _PR3 object to be powered off.
> >> The GPU's BARs will be uninitialized when a driver later probes it.
> >>
> >> Another example are hybrid graphics laptops where the GPU itself (rather
> >> than the root port) is capable of runtime suspending to D3cold.  If the
> >> GPU's integrated HDA controller is not bound and the GPU's driver
> >> decides to runtime suspend to D3cold, the HDA controller's BARs will be
> >> uninitialized when a driver later probes it.
> >>
> >> Fix by restoring the BARs on runtime resume if the device is not bound.
> >> This is sufficient to fix the above-mentioned use cases.  Other use
> >> cases might require a full-blown pci_save_state() / pci_restore_state()
> >> or execution of fixups.  We can add that once use cases materialize,
> >> let's not inflate the code unnecessarily.
> >
> > Why would we not do a full-blown restore?  With this patch, I think
> > some configuration done during enumeration, e.g., ASPM and MPS, will
> > be lost.  "lspci -vv" of the HDA before and after the suspend may show
> > different things, which seems counter-intuitive.
> 
> Right.
> 
> > I wouldn't think of a full-blown restore as "inflating the code"; I
> > would think of it as "having fewer special cases", i.e., we always use
> > the same restore path instead of having the main one plus a special
> > one for unbound devices.
> 
> That is a fair point, but if you look at pci_pm_runtime_suspend(), it
> doesn't actually save anything for devices without drivers, so we'd
> just restore the original initial state for them every time.
> 
> If we are to restore the entire state on runtime resume,
> pci_pm_runtime_suspend() should be changed to call pci_save_state()
> before returning 0 in the !dev->driver case AFAICS.
> 
> >> Cc: Bjorn Helgaas <bhelg...@google.com>
> >> Cc: Rafael J. Wysocki <rafael.j.wyso...@intel.com>
> >> Signed-off-by: Lukas Wunner <lu...@wunner.de>
> >> ---
> >>  drivers/pci/pci-driver.c | 8 ++--
> >>  drivers/pci/pci.c| 2 +-
> >>  drivers/pci/pci.h| 1 +
> >>  3 files changed, 8 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> >> index 3bed6beda051..51b11cbd48f6 100644
> >> --- a/drivers/pci/pci-driver.c
> >> +++ b/drivers/pci/pci-driver.c
> >> @@ -1277,10 +1277,14 @@ static int pci_pm_runtime_resume(struct device 
> >> *dev)
> >>
> >>   /*
> >>* If pci_dev->driver is not set (unbound), the device should
> >> -  * always remain in D0 regardless of the runtime PM status
> >> +  * always remain in D0 regardless of the runtime PM status.
> >> +  * But if its parent can go to D3cold, this device may have
> >> +  * been in D3cold as well and require restoration of its BARs.
&g

Re: [Nouveau] [PATCH 1/7] PCI: Restore BARs on runtime resume despite being unbound

2018-02-21 Thread Rafael J. Wysocki

On Tue, Feb 20, 2018 at 10:29 PM, Bjorn Helgaas <helg...@kernel.org> wrote:
> On Sun, Feb 18, 2018 at 09:38:32AM +0100, Lukas Wunner wrote:
>> PCI devices not bound to a driver are supposed to stay in D0 during
>> runtime suspend.
>
> Doesn't "runtime suspend" mean an individual device is suspended while
> the rest of the system remains active?
>
> If so, maybe it would be more direct to say "PCI devices not bound to
> a driver cannot be runtime suspended"?
>
> (It's a separate question not relevant to this patch, but I'm not
> convinced that "PCI devices without a driver cannot be suspended"
> should be accepted as a rule.  If it is a rule, we should be able to
> deduce it from the specs.)
>
>> But they may have a parent which is bound and can be
>> transitioned to D3cold at runtime.  Once the parent goes to D3cold, the
>> unbound child may go to D3cold as well.  When the child comes out of
>> D3cold, its BARs are uninitialized and thus inaccessible when a driver
>> tries to probe.
>>
>> One example are recent hybrid graphics laptops which cut power to the
>> discrete GPU when the root port above it goes to ACPI power state D3.
>> Users may provoke this by unbinding the GPU driver and allowing runtime
>> PM on the GPU via sysfs:  The PM core will then treat the GPU as
>> "suspended", which in turn allows the root port to runtime suspend,
>> causing the power resources listed in its _PR3 object to be powered off.
>> The GPU's BARs will be uninitialized when a driver later probes it.
>>
>> Another example are hybrid graphics laptops where the GPU itself (rather
>> than the root port) is capable of runtime suspending to D3cold.  If the
>> GPU's integrated HDA controller is not bound and the GPU's driver
>> decides to runtime suspend to D3cold, the HDA controller's BARs will be
>> uninitialized when a driver later probes it.
>>
>> Fix by restoring the BARs on runtime resume if the device is not bound.
>> This is sufficient to fix the above-mentioned use cases.  Other use
>> cases might require a full-blown pci_save_state() / pci_restore_state()
>> or execution of fixups.  We can add that once use cases materialize,
>> let's not inflate the code unnecessarily.
>
> Why would we not do a full-blown restore?  With this patch, I think
> some configuration done during enumeration, e.g., ASPM and MPS, will
> be lost.  "lspci -vv" of the HDA before and after the suspend may show
> different things, which seems counter-intuitive.

Right.

> I wouldn't think of a full-blown restore as "inflating the code"; I
> would think of it as "having fewer special cases", i.e., we always use
> the same restore path instead of having the main one plus a special
> one for unbound devices.

That is a fair point, but if you look at pci_pm_runtime_suspend(), it
doesn't actually save anything for devices without drivers, so we'd
just restore the original initial state for them every time.

If we are to restore the entire state on runtime resume,
pci_pm_runtime_suspend() should be changed to call pci_save_state()
before returning 0 in the !dev->driver case AFAICS.

>> Cc: Bjorn Helgaas <bhelg...@google.com>
>> Cc: Rafael J. Wysocki <rafael.j.wyso...@intel.com>
>> Signed-off-by: Lukas Wunner <lu...@wunner.de>
>> ---
>>  drivers/pci/pci-driver.c | 8 ++--
>>  drivers/pci/pci.c| 2 +-
>>  drivers/pci/pci.h| 1 +
>>  3 files changed, 8 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
>> index 3bed6beda051..51b11cbd48f6 100644
>> --- a/drivers/pci/pci-driver.c
>> +++ b/drivers/pci/pci-driver.c
>> @@ -1277,10 +1277,14 @@ static int pci_pm_runtime_resume(struct device *dev)
>>
>>   /*
>>* If pci_dev->driver is not set (unbound), the device should
>> -  * always remain in D0 regardless of the runtime PM status
>> +  * always remain in D0 regardless of the runtime PM status.
>> +  * But if its parent can go to D3cold, this device may have
>> +  * been in D3cold as well and require restoration of its BARs.
>>*/
>> - if (!pci_dev->driver)
>> + if (!pci_dev->driver) {
>> + pci_restore_bars(pci_dev);
>
> If we do decide not to do a full-blown restore, how do we decide
> whether to use pci_restore_bars() or pci_restore_config_space()?
>
> I'm not sure why we have both.

Me neither.

> The pci_restore_bars() path looks a
> little smarter in that it is more careful when updating 64-bit BARs
> that can't be updated atomically.
>

Re: [Nouveau] [PATCH 1/7] PCI: Restore BARs on runtime resume despite being unbound

2018-02-19 Thread Rafael J. Wysocki

On Sun, Feb 18, 2018 at 9:38 AM, Lukas Wunner <lu...@wunner.de> wrote:
> PCI devices not bound to a driver are supposed to stay in D0 during
> runtime suspend.  But they may have a parent which is bound and can be
> transitioned to D3cold at runtime.  Once the parent goes to D3cold, the
> unbound child may go to D3cold as well.  When the child comes out of
> D3cold, its BARs are uninitialized and thus inaccessible when a driver
> tries to probe.
>
> One example are recent hybrid graphics laptops which cut power to the
> discrete GPU when the root port above it goes to ACPI power state D3.
> Users may provoke this by unbinding the GPU driver and allowing runtime
> PM on the GPU via sysfs:  The PM core will then treat the GPU as
> "suspended", which in turn allows the root port to runtime suspend,
> causing the power resources listed in its _PR3 object to be powered off.
> The GPU's BARs will be uninitialized when a driver later probes it.
>
> Another example are hybrid graphics laptops where the GPU itself (rather
> than the root port) is capable of runtime suspending to D3cold.  If the
> GPU's integrated HDA controller is not bound and the GPU's driver
> decides to runtime suspend to D3cold, the HDA controller's BARs will be
> uninitialized when a driver later probes it.
>
> Fix by restoring the BARs on runtime resume if the device is not bound.
> This is sufficient to fix the above-mentioned use cases.  Other use
> cases might require a full-blown pci_save_state() / pci_restore_state()
> or execution of fixups.  We can add that once use cases materialize,
> let's not inflate the code unnecessarily.
>
> Cc: Bjorn Helgaas <bhelg...@google.com>
> Cc: Rafael J. Wysocki <rafael.j.wyso...@intel.com>
> Signed-off-by: Lukas Wunner <lu...@wunner.de>

Reviewed-by: Rafael J. Wysocki <rafael.j.wyso...@intel.com>

> ---
>  drivers/pci/pci-driver.c | 8 ++--
>  drivers/pci/pci.c| 2 +-
>  drivers/pci/pci.h| 1 +
>  3 files changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 3bed6beda051..51b11cbd48f6 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -1277,10 +1277,14 @@ static int pci_pm_runtime_resume(struct device *dev)
>
> /*
>  * If pci_dev->driver is not set (unbound), the device should
> -* always remain in D0 regardless of the runtime PM status
> +* always remain in D0 regardless of the runtime PM status.
> +* But if its parent can go to D3cold, this device may have
> +* been in D3cold as well and require restoration of its BARs.
>  */
> -   if (!pci_dev->driver)
> +   if (!pci_dev->driver) {
> +   pci_restore_bars(pci_dev);
> return 0;
> +   }
>
> if (!pm || !pm->runtime_resume)
> return -ENOSYS;
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index f6a4dd10d9b0..f694650235f2 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -563,7 +563,7 @@ int pci_wait_for_pending(struct pci_dev *dev, int pos, 
> u16 mask)
>   * Restore the BAR values for a given device, so as to make it
>   * accessible by its driver.
>   */
> -static void pci_restore_bars(struct pci_dev *dev)
> +void pci_restore_bars(struct pci_dev *dev)
>  {
> int i;
>
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index fcd81911b127..29dc15bbe3bf 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -83,6 +83,7 @@ void pci_allocate_cap_save_buffers(struct pci_dev *dev);
>  void pci_free_cap_save_buffers(struct pci_dev *dev);
>  bool pci_bridge_d3_possible(struct pci_dev *dev);
>  void pci_bridge_d3_update(struct pci_dev *dev);
> +void pci_restore_bars(struct pci_dev *dev);
>
>  static inline void pci_wakeup_event(struct pci_dev *dev)
>  {
> --
> 2.15.1
>
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH 00/35] treewide trivial patches converting pr_warning to pr_warn

2017-02-17 Thread Rafael J. Wysocki

On Fri, Feb 17, 2017 at 8:11 AM, Joe Perches  wrote:
> There are ~4300 uses of pr_warn and ~250 uses of the older
> pr_warning in the kernel source tree.
>
> Make the use of pr_warn consistent across all kernel files.
>
> This excludes all files in tools/ as there is a separate
> define pr_warning for that directory tree and pr_warn is
> not used in tools/.
>
> Done with 'sed s/\bpr_warning\b/pr_warn/' and some emacsing.

Sorry about asking if that has been asked already.

Wouldn't it be slightly less intrusive to simply redefined
pr_warning() as a synonym for pr_warn()?

Thanks,
Rafael
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH] acpi: video: Move ACPI_VIDEO_NOTIFY_* defines to acpi/video.h

2016-12-19 Thread Rafael J. Wysocki

On Wednesday, November 09, 2016 06:15:56 PM Hans de Goede wrote:
> acpi_video.c passed the ACPI_VIDEO_NOTIFY_* defines as type code to
> acpi_notifier_call_chain(). Move these defines to acpi/video.h so
> that acpi_notifier listeners can check the type code using these
> defines.
> 
> Signed-off-by: Hans de Goede 

I've applied this one, but I'd like to get an ACK from the DRM folks on
the second one.

Thanks,
Rafael

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

49 matches

Mail list logo