Re: [REGRESSION]: nouveau: Asynchronous wait on fence

2023-12-05 Thread Owen T. Heisler

Hi Thorsten and others,

On 12/5/23 06:33, Thorsten Leemhuis wrote:

On 29.11.23 01:37, Owen T. Heisler wrote:

On 11/21/23 14:23, Owen T. Heisler wrote:

On 11/21/23 09:16, Linux regression tracking (Thorsten Leemhuis) wrote:

On 15.11.23 07:19, Owen T. Heisler wrote:

On 10/31/23 04:18, Linux regression tracking (Thorsten Leemhuis) wrote:

On 28.10.23 04:46, Owen T. Heisler wrote:

#regzbot introduced: d386a4b54607cf6f76e23815c2c9a3abc1d66882
#regzbot link:
https://gitlab.freedesktop.org/drm/nouveau/-/issues/180

3. Suddenly the secondary Nvidia-connected display turns off and X
stops responding to keyboard/mouse input.



I am currently testing v6.6 with the culprit commit reverted.


- v6.6: fails
- v6.6 with the culprit commit reverted: works

See <https://gitlab.freedesktop.org/drm/nouveau/-/issues/180> for full
details including a decoded kernel log.


Not sure about the others, but it's kind of confusing that you update
the issue descriptions all the time and never add a comment to that ticket.


Thank you for the feedback; I will use comments more for future updates 
there. I didn't know anyone was following that issue (I haven't received 
any reply from nouveau developers on the nouveau list [1] or on gitlab 
[2]) so I have tried to keep that issue description succinct and 
up-to-date for anyone reading it for the first time.


[1]: 
<https://lists.freedesktop.org/archives/nouveau/2022-September/041001.html>

[2]: But Karol Herbst did add the "regression" label.


Anyway: Nouveau maintainers, could any of you at least comment on this?
Sure, it's the regression is caused by an old commit (6eaa1f3c59a707 was
merged for v5.14-rc7) and reverting it likely is not a option, but it
nevertheless it would be great if this could be solved somehow.


Also if anyone has any ideas about any stress-tests or anything else 
that I might be able to trigger the crash with, please share.


Thanks,
Owen

--
Owen T. Heisler
<https://owenh.net>


Re: [REGRESSION]: nouveau: Asynchronous wait on fence

2023-11-28 Thread Owen T. Heisler

On 11/21/23 14:23, Owen T. Heisler wrote:

On 11/21/23 09:16, Linux regression tracking (Thorsten Leemhuis) wrote:

On 15.11.23 07:19, Owen T. Heisler wrote:

On 10/31/23 04:18, Linux regression tracking (Thorsten Leemhuis) wrote:

On 28.10.23 04:46, Owen T. Heisler wrote:

#regzbot introduced: d386a4b54607cf6f76e23815c2c9a3abc1d66882
#regzbot link: https://gitlab.freedesktop.org/drm/nouveau/-/issues/180

3. Suddenly the secondary Nvidia-connected display turns off and X 
stops responding to keyboard/mouse input.



I am currently testing v6.6 with the culprit commit reverted.


- v6.6: fails
- v6.6 with the culprit commit reverted: works

See <https://gitlab.freedesktop.org/drm/nouveau/-/issues/180> for full 
details including a decoded kernel log.


Thanks,
Owen

--
Owen T. Heisler
<https://owenh.net>


Re: [REGRESSION]: nouveau: Asynchronous wait on fence

2023-11-21 Thread Owen T. Heisler

On 11/21/23 09:16, Linux regression tracking (Thorsten Leemhuis) wrote:

On 15.11.23 07:19, Owen T. Heisler wrote:

On 10/31/23 04:18, Linux regression tracking (Thorsten Leemhuis) wrote:

On 28.10.23 04:46, Owen T. Heisler wrote:

#regzbot introduced: d386a4b54607cf6f76e23815c2c9a3abc1d66882
#regzbot link: https://gitlab.freedesktop.org/drm/nouveau/-/issues/180

## Problem

1. Connect external display to DVI port on dock and run X with both
     displays in use.
2. Wait hours or days.
3. Suddenly the secondary Nvidia-connected display turns off and X stops
     responding to keyboard/mouse input. In *some* cases it is
possible to
     switch to a virtual TTY with Ctrl+Alt+Fn and log in there.



Here is a decoded kernel log from an
untainted kernel:

https://gitlab.freedesktop.org/drm/nouveau/uploads/c120faf09da46f9c74006df9f1d14442/async-wait-on-fence-180.log



Maybe one of the nouveau developer can take a quick look at
d386a4b54607cf and suggest a simple way to revert it in latest mainline.
Maybe just removing the main chunk of code that is added is all that it
takes.


I was able to resolve the revert conflict; it was indeed trivial though 
I did not realize it initially. I am currently testing v6.6 with the 
culprit commit reverted. I need to test for at least a full week (ending 
11-23) before I can assume it fixes the problem.


After that I can try the latest v6.7-rc as you suggested.

I have updated the bug description at
<https://gitlab.freedesktop.org/drm/nouveau/-/issues/180>.

Thanks again,
Owen

--
Owen T. Heisler
<https://owenh.net>


Re: [REGRESSION]: acpi/nouveau: Hardware unavailable upon resume or suspend fails

2023-11-16 Thread Owen T. Heisler

On 11/12/23 14:43, Hans de Goede wrote:

Owen, Kai-Heng thank you for testing. I've submitted these patches
to Rafael (the ACPI maintainer) now (with you on the Cc).
Hopefully they will get merged soon.


That's great, thanks!

Owen



Re: [REGRESSION]: nouveau: Asynchronous wait on fence

2023-11-14 Thread Owen T. Heisler

On 10/31/23 04:18, Linux regression tracking (Thorsten Leemhuis) wrote:

On 28.10.23 04:46, Owen T. Heisler wrote:

#regzbot introduced: d386a4b54607cf6f76e23815c2c9a3abc1d66882
#regzbot link: https://gitlab.freedesktop.org/drm/nouveau/-/issues/180

## Problem

1. Connect external display to DVI port on dock and run X with both
    displays in use.
2. Wait hours or days.
3. Suddenly the secondary Nvidia-connected display turns off and X stops
    responding to keyboard/mouse input. In *some* cases it is possible to
    switch to a virtual TTY with Ctrl+Alt+Fn and log in there.



You thus might want to check if the problem occurs with 6.6 -- and
ideally also check if reverting the culprit there fixes things for you.


Hi Thorsten and others,

The problem also occurs with v6.6. Here is a decoded kernel log from an 
untainted kernel:


https://gitlab.freedesktop.org/drm/nouveau/uploads/c120faf09da46f9c74006df9f1d14442/async-wait-on-fence-180.log

The culprit commit does not revert cleanly on v6.6. I have not yet 
attempted to resolve the conflicts.


I have also updated the bug description at
<https://gitlab.freedesktop.org/drm/nouveau/-/issues/180>.

Thanks,
Owen


Re: [REGRESSION]: acpi/nouveau: Hardware unavailable upon resume or suspend fails

2023-11-11 Thread Owen T. Heisler

Hi everyone,

On 11/10/23 06:52, Kai-Heng Feng wrote:

On Fri, Nov 10, 2023 at 2:19 PM Hans de Goede  wrote:

On 11/10/23 07:09, Kai-Heng Feng wrote:

On Fri, Nov 10, 2023 at 5:55 AM Owen T. Heisler  wrote:

#regzbot introduced: 89c290ea758911e660878e26270e084d862c03b0
#regzbot link: https://gitlab.freedesktop.org/drm/nouveau/-/issues/273
#regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=218124


Thanks for the bug report. Do you prefer to continue the discussion
here, on gitlab or on bugzilla?


Kai-Heng, you're welcome and thank you too. By email is fine with me.


Owen, as Kai-Heng said thank you for reporting this.


Hans, you're welcome, and thanks for your help too.


## Reproducing

1. Boot system to framebuffer console.
2. Run `systemctl suspend`. If undocked without secondary display,
suspend fails. If docked with secondary display, suspend succeeds.
3. Resume from suspend if applicable.
4. System is now in a broken state.


So I guess we need to put those devices to ACPI D3 for suspend. Let's
discuss this on your preferred platform.


Ok, so I was already sort of afraid we might see something like this
happening because of:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=89c290ea758911e660878e26270e084d862c03b0

As I mentioned during the review of that, it might be better to
not touch the video-card ACPI power-state at all and instead
only do acpi_device_fix_up_power() on the child devices.


Or the child devices need to be put to D3 during suspend.


Owen, attached are 2 patches which implement only
calling acpi_device_fix_up_power() on the child devices,
can you build a v6.6 kernel with these 2 patches added
on top please and see if that fixes things ?


Yes, with those patches v6.6 suspend works normally. That's great, thanks!

I tested with v6.6 with the 2 patches at 
<https://lore.kernel.org/regressions/a592ce0c-64f0-477d-80fa-8f5a52ba2...@redhat.com/> 
using 
<https://gitlab.freedesktop.org/drm/nouveau/uploads/788d7faf22ba2884dcc09d7be931e813/v6.6-config1>. 
I tested both docked and un-docked, just in case.


Tested-by: Owen T. Heisler 


Kai-Heng can you test that the issue on the HP ZBook Fury 16 G10
is still resolved after applying these patches ?


Yes. Thanks for the patch.

If this patch also fixes Owen's issue, then
Tested-by: Kai-Heng Feng 

Please let me know if anything else is needed from me.

Many thanks,
Owen


[REGRESSION]: acpi/nouveau: Hardware unavailable upon resume or suspend fails

2023-11-10 Thread Owen T. Heisler

#regzbot introduced: 89c290ea758911e660878e26270e084d862c03b0
#regzbot link: https://gitlab.freedesktop.org/drm/nouveau/-/issues/273
#regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=218124

## Reproducing

1. Boot system to framebuffer console.
2. Run `systemctl suspend`. If undocked without secondary display, 
suspend fails. If docked with secondary display, suspend succeeds.

3. Resume from suspend if applicable.
4. System is now in a broken state.

## Testing

- culprit commit is 89c290ea758911e660878e26270e084d862c03b0
- v6.6 fails
- v6.6 with culprit commit reverted does not fail
- Compiled with 



## Hardware

- ThinkPad W530 2438-52U
- Dock with Nvidia-connected DVI ports
- Secondary display connected via DVI
- Nvidia Optimus GPU switching system

```console
$ lspci | grep -i vga
00:02.0 VGA compatible controller: Intel Corporation 3rd Gen Core 
processor Graphics Controller (rev 09)
01:00.0 VGA compatible controller: NVIDIA Corporation GK107GLM [Quadro 
K2000M] (rev a1)

```

## Decoded logs from v6.6

- System is not docked and fails to suspend: 

- System is docked and fails after resume: