Re: [REGRESSION]: nouveau: Asynchronous wait on fence

2024-01-16 Thread Owen T. Heisler

On 10/31/23 04:18, Linux regression tracking (Thorsten Leemhuis) wrote:

Thanks for your report. With a bit of luck someone will look into this,
But I doubt it, as this report has some aspects why it might be ignored.
Mainly: (a) the report was about a stable/longterm kernel and (b)it's
afaics unclear if the problem even happens with the latest mainline
kernel.



You thus might want to check if the problem occurs with 6.6 -- and
ideally also check if reverting the culprit there fixes things for you.


Thorsten,

Thank you for your reply and suggestions. I will try (a) testing on 
mainline (when I tried before I was interrupted by another, unrelated 
regression) and (b) reverting the culprit commit there if I am able to 
reproduce the problem.


Thanks,
Owen

--
Owen T. Heisler
https://owenh.net


Re: [REGRESSION]: nouveau: Asynchronous wait on fence

2024-01-16 Thread Owen T. Heisler

On 10/31/23 04:18, Linux regression tracking (Thorsten Leemhuis) wrote:

On 28.10.23 04:46, Owen T. Heisler wrote:

#regzbot introduced: d386a4b54607cf6f76e23815c2c9a3abc1d66882
#regzbot link: https://gitlab.freedesktop.org/drm/nouveau/-/issues/180

## Problem

1. Connect external display to DVI port on dock and run X with both
    displays in use.
2. Wait hours or days.
3. Suddenly the secondary Nvidia-connected display turns off and X stops
    responding to keyboard/mouse input. In *some* cases it is possible to
    switch to a virtual TTY with Ctrl+Alt+Fn and log in there.



You thus might want to check if the problem occurs with 6.6 -- and
ideally also check if reverting the culprit there fixes things for you.


Hi Thorsten and others,

The problem also occurs with v6.6. Here is a decoded kernel log from an 
untainted kernel:


https://gitlab.freedesktop.org/drm/nouveau/uploads/c120faf09da46f9c74006df9f1d14442/async-wait-on-fence-180.log

The culprit commit does not revert cleanly on v6.6. I have not yet 
attempted to resolve the conflicts.


I have also updated the bug description at
<https://gitlab.freedesktop.org/drm/nouveau/-/issues/180>.

Thanks,
Owen


Re: [REGRESSION]: acpi/nouveau: Hardware unavailable upon resume or suspend fails

2024-01-16 Thread Owen T. Heisler

Hi everyone,

On 11/10/23 06:52, Kai-Heng Feng wrote:

On Fri, Nov 10, 2023 at 2:19 PM Hans de Goede  wrote:

On 11/10/23 07:09, Kai-Heng Feng wrote:

On Fri, Nov 10, 2023 at 5:55 AM Owen T. Heisler  wrote:

#regzbot introduced: 89c290ea758911e660878e26270e084d862c03b0
#regzbot link: https://gitlab.freedesktop.org/drm/nouveau/-/issues/273
#regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=218124


Thanks for the bug report. Do you prefer to continue the discussion
here, on gitlab or on bugzilla?


Kai-Heng, you're welcome and thank you too. By email is fine with me.


Owen, as Kai-Heng said thank you for reporting this.


Hans, you're welcome, and thanks for your help too.


## Reproducing

1. Boot system to framebuffer console.
2. Run `systemctl suspend`. If undocked without secondary display,
suspend fails. If docked with secondary display, suspend succeeds.
3. Resume from suspend if applicable.
4. System is now in a broken state.


So I guess we need to put those devices to ACPI D3 for suspend. Let's
discuss this on your preferred platform.


Ok, so I was already sort of afraid we might see something like this
happening because of:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=89c290ea758911e660878e26270e084d862c03b0

As I mentioned during the review of that, it might be better to
not touch the video-card ACPI power-state at all and instead
only do acpi_device_fix_up_power() on the child devices.


Or the child devices need to be put to D3 during suspend.


Owen, attached are 2 patches which implement only
calling acpi_device_fix_up_power() on the child devices,
can you build a v6.6 kernel with these 2 patches added
on top please and see if that fixes things ?


Yes, with those patches v6.6 suspend works normally. That's great, thanks!

I tested with v6.6 with the 2 patches at 
<https://lore.kernel.org/regressions/a592ce0c-64f0-477d-80fa-8f5a52ba2...@redhat.com/> 
using 
<https://gitlab.freedesktop.org/drm/nouveau/uploads/788d7faf22ba2884dcc09d7be931e813/v6.6-config1>. 
I tested both docked and un-docked, just in case.


Tested-by: Owen T. Heisler 


Kai-Heng can you test that the issue on the HP ZBook Fury 16 G10
is still resolved after applying these patches ?


Yes. Thanks for the patch.

If this patch also fixes Owen's issue, then
Tested-by: Kai-Heng Feng 

Please let me know if anything else is needed from me.

Many thanks,
Owen


[REGRESSION]: acpi/nouveau: Hardware unavailable upon resume or suspend fails

2024-01-16 Thread Owen T. Heisler

#regzbot introduced: 89c290ea758911e660878e26270e084d862c03b0
#regzbot link: https://gitlab.freedesktop.org/drm/nouveau/-/issues/273
#regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=218124

## Reproducing

1. Boot system to framebuffer console.
2. Run `systemctl suspend`. If undocked without secondary display, 
suspend fails. If docked with secondary display, suspend succeeds.

3. Resume from suspend if applicable.
4. System is now in a broken state.

## Testing

- culprit commit is 89c290ea758911e660878e26270e084d862c03b0
- v6.6 fails
- v6.6 with culprit commit reverted does not fail
- Compiled with 



## Hardware

- ThinkPad W530 2438-52U
- Dock with Nvidia-connected DVI ports
- Secondary display connected via DVI
- Nvidia Optimus GPU switching system

```console
$ lspci | grep -i vga
00:02.0 VGA compatible controller: Intel Corporation 3rd Gen Core 
processor Graphics Controller (rev 09)
01:00.0 VGA compatible controller: NVIDIA Corporation GK107GLM [Quadro 
K2000M] (rev a1)

```

## Decoded logs from v6.6

- System is not docked and fails to suspend: 

- System is docked and fails after resume: 



[REGRESSION]: nouveau: Asynchronous wait on fence

2024-01-16 Thread Owen T. Heisler

#regzbot introduced: d386a4b54607cf6f76e23815c2c9a3abc1d66882
#regzbot link: https://gitlab.freedesktop.org/drm/nouveau/-/issues/180

## Problem

1. Connect external display to DVI port on dock and run X with both
   displays in use.
2. Wait hours or days.
3. Suddenly the secondary Nvidia-connected display turns off and X stops
   responding to keyboard/mouse input. In *some* cases it is possible to
   switch to a virtual TTY with Ctrl+Alt+Fn and log in there. In any
   case, shutdown/reboot after this happens is *usually* not successful
   (forced power-off is required).

This started happening after the upgrade to Debian bullseye, and the
problem remains with Debian bookworm.

## Reproducing

Unfortunately I have not found a way to reliably reproduce this bug. An
affected kernel version will crash infrequently, perhaps monthly with
"regular workstation use".

With the more recent v6.1 kernels the crash seems to occur more
frequently, perhaps within two days, if the out-of-tree acpi-call module
(Debian's [acpi-call-dkms](https://packages.debian.org/acpi-call-dkms)
package) is installed.

For the logs below that are from tainted kernels, that out-of-tree
module is why. The latest crash with the same `Asynchronous wait on
fence` error occured on Debian stock linux-image-6.1.0-13-amd64
v6.1.55-1 with /proc/sys/kernel/tainted == 0.

I have failed to trigger this crash by using 3D acceleration,
[Piglit](https://piglit.freedesktop.org/) tests, or other stress tests.

## Errors

Common log errors (see more log data [below](#user-content-logs)):

- `Asynchronous wait on fence nouveau:systemd-logind`
- `RIP: 0010:gf119_head_state+0xdd/0x110 [nouveau]`
- `nouveau :01:00.0: timer: stalled at ` (or `tmr:
  stalled at `)
- `Fixing recursive fault but reboot is needed!`

## Kernel versions

At least mainline, v6.1, v5.15, v5.10, v5.4, and v4.19 series all seem
to be affected by this bug.

Here are the results of testing various specific revisions:

- v6.1.51 bad
- v5.15.130 bad
- v5.14-rc6-2-g6eaa1f3c59a7 good
- v5.10.194 bad
- v5.4.256 bad
- v4.19.294 bad
- v4.19.289 bad
- v4.19.255 bad
- v4.19.234-18-gd60f27d6ff8a bad
- v4.19.232 bad
- v4.19.205-27-gd386a4b54607 bad
- v4.19.205-26-ga78f93b9bba1 good
- v4.19.205 good
- v4.19.200 good
- v4.19.180 good

The `git bisect` result from this:

```
d386a4b54607cf6f76e23815c2c9a3abc1d66882 is the first bad commit
```

## Kernel configuration and compilation

Various minimal kernel configs have not exhibited the bug, so I am
testing with a Debian stock config as follows. This applies to all
tests.

- Start with the stock config from Debian
  `linux-image-4.19.0-18-amd64_4.19.208-1` at
  

- make olddefconfig
- make clean
- scripts/config --disable DEBUG_INFO \
 --set-str LOCALVERSION -$mylocalversion \
 --enable LOCALVERSION_AUTO \
 --set-str SYSTEM_TRUSTED_KEYS ''
- make -j$(nproc --all) bindeb-pkg

## Kernel boot parameters

The `nouveau.config=NvClkMode=15` parameter is in use for all
tests. Omitting this parameter does not seem to change the behavior.

## Hardware

- ThinkPad W530
- Dock with Nvidia-connected DVI ports
- Secondary display connected via DVI
- Nvidia Optimus GPU switching system

$ lspci | grep -i vga
00:02.0 VGA compatible controller: Intel Corporation 3rd Gen Core processor 
Graphics Controller (rev 09)
01:00.0 VGA compatible controller: NVIDIA Corporation GK107GLM [Quadro K2000M] 
(rev a1)

## Logs

Here is some log data I have been able to capture.

dmesg output from Debian bullseye with current nouveau-next commit
9622bcb7c72b230d64b7f7d2f9505e17214f3597. The `Asynchronous wait on
fence` happened approximately when the display turned off. The traces
happened later, triggered by running xrandr:

```
[0.00] Linux version 5.19.0-rc6-00126-g9622bcb7c72b (user@hostname) 
(gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 
2.35.2) #23 SMP PREEMPT_DYNAMIC Thu Sep 15 19:52:44 CDT 2022
[0.00] Command line: BOOT_IMAGE=/vmlinuz-5.19.0-rc6-00126-g9622bcb7c72b 
root=/dev/sda1 ro quiet intel_iommu=on,igfx_off l1tf=full,force mds=full,nosmt 
mitigations=auto,nosmt nosmt=force nouveau.config=NvClkMode=15 log_buf_len=1M
[0.025823] Kernel command line: 
BOOT_IMAGE=/vmlinuz-5.19.0-rc6-00126-g9622bcb7c72b root=/dev/sda1 ro quiet 
intel_iommu=on,igfx_off l1tf=full,force mds=full,nosmt mitigations=auto,nosmt 
nosmt=force nouveau.config=NvClkMode=15 log_buf_len=1M
[0.216222] pci :00:02.0: vgaarb: setting as boot VGA device
[0.216222] pci :00:02.0: vgaarb: bridge control possible
[0.216222] pci :00:02.0: vgaarb: VGA device added: 
decodes=io+mem,owns=io+mem,locks=none
[0.216222] pci :01:00.0: vgaarb: bridge control possible
[0.216222] pci :01:00.0: vgaarb: VGA device added: 
decodes=io+mem,owns=none,locks=none
[

Re: [Nouveau] [REGRESSION]: nouveau: Asynchronous wait on fence

2023-12-05 Thread Owen T. Heisler

Hi Thorsten and others,

On 12/5/23 06:33, Thorsten Leemhuis wrote:

On 29.11.23 01:37, Owen T. Heisler wrote:

On 11/21/23 14:23, Owen T. Heisler wrote:

On 11/21/23 09:16, Linux regression tracking (Thorsten Leemhuis) wrote:

On 15.11.23 07:19, Owen T. Heisler wrote:

On 10/31/23 04:18, Linux regression tracking (Thorsten Leemhuis) wrote:

On 28.10.23 04:46, Owen T. Heisler wrote:

#regzbot introduced: d386a4b54607cf6f76e23815c2c9a3abc1d66882
#regzbot link:
https://gitlab.freedesktop.org/drm/nouveau/-/issues/180

3. Suddenly the secondary Nvidia-connected display turns off and X
stops responding to keyboard/mouse input.



I am currently testing v6.6 with the culprit commit reverted.


- v6.6: fails
- v6.6 with the culprit commit reverted: works

See <https://gitlab.freedesktop.org/drm/nouveau/-/issues/180> for full
details including a decoded kernel log.


Not sure about the others, but it's kind of confusing that you update
the issue descriptions all the time and never add a comment to that ticket.


Thank you for the feedback; I will use comments more for future updates 
there. I didn't know anyone was following that issue (I haven't received 
any reply from nouveau developers on the nouveau list [1] or on gitlab 
[2]) so I have tried to keep that issue description succinct and 
up-to-date for anyone reading it for the first time.


[1]: 
<https://lists.freedesktop.org/archives/nouveau/2022-September/041001.html>

[2]: But Karol Herbst did add the "regression" label.


Anyway: Nouveau maintainers, could any of you at least comment on this?
Sure, it's the regression is caused by an old commit (6eaa1f3c59a707 was
merged for v5.14-rc7) and reverting it likely is not a option, but it
nevertheless it would be great if this could be solved somehow.


Also if anyone has any ideas about any stress-tests or anything else 
that I might be able to trigger the crash with, please share.


Thanks,
Owen

--
Owen T. Heisler
<https://owenh.net>


Re: [Nouveau] [REGRESSION]: nouveau: Asynchronous wait on fence

2023-11-28 Thread Owen T. Heisler

On 11/21/23 14:23, Owen T. Heisler wrote:

On 11/21/23 09:16, Linux regression tracking (Thorsten Leemhuis) wrote:

On 15.11.23 07:19, Owen T. Heisler wrote:

On 10/31/23 04:18, Linux regression tracking (Thorsten Leemhuis) wrote:

On 28.10.23 04:46, Owen T. Heisler wrote:

#regzbot introduced: d386a4b54607cf6f76e23815c2c9a3abc1d66882
#regzbot link: https://gitlab.freedesktop.org/drm/nouveau/-/issues/180

3. Suddenly the secondary Nvidia-connected display turns off and X 
stops responding to keyboard/mouse input.



I am currently testing v6.6 with the culprit commit reverted.


- v6.6: fails
- v6.6 with the culprit commit reverted: works

See <https://gitlab.freedesktop.org/drm/nouveau/-/issues/180> for full 
details including a decoded kernel log.


Thanks,
Owen

--
Owen T. Heisler
<https://owenh.net>


Re: [Nouveau] [REGRESSION]: nouveau: Asynchronous wait on fence

2023-11-21 Thread Owen T. Heisler

On 11/21/23 09:16, Linux regression tracking (Thorsten Leemhuis) wrote:

On 15.11.23 07:19, Owen T. Heisler wrote:

On 10/31/23 04:18, Linux regression tracking (Thorsten Leemhuis) wrote:

On 28.10.23 04:46, Owen T. Heisler wrote:

#regzbot introduced: d386a4b54607cf6f76e23815c2c9a3abc1d66882
#regzbot link: https://gitlab.freedesktop.org/drm/nouveau/-/issues/180

## Problem

1. Connect external display to DVI port on dock and run X with both
     displays in use.
2. Wait hours or days.
3. Suddenly the secondary Nvidia-connected display turns off and X stops
     responding to keyboard/mouse input. In *some* cases it is
possible to
     switch to a virtual TTY with Ctrl+Alt+Fn and log in there.



Here is a decoded kernel log from an
untainted kernel:

https://gitlab.freedesktop.org/drm/nouveau/uploads/c120faf09da46f9c74006df9f1d14442/async-wait-on-fence-180.log



Maybe one of the nouveau developer can take a quick look at
d386a4b54607cf and suggest a simple way to revert it in latest mainline.
Maybe just removing the main chunk of code that is added is all that it
takes.


I was able to resolve the revert conflict; it was indeed trivial though 
I did not realize it initially. I am currently testing v6.6 with the 
culprit commit reverted. I need to test for at least a full week (ending 
11-23) before I can assume it fixes the problem.


After that I can try the latest v6.7-rc as you suggested.

I have updated the bug description at
<https://gitlab.freedesktop.org/drm/nouveau/-/issues/180>.

Thanks again,
Owen

--
Owen T. Heisler
<https://owenh.net>


Re: [Nouveau] [REGRESSION]: acpi/nouveau: Hardware unavailable upon resume or suspend fails

2023-11-16 Thread Owen T. Heisler

On 11/12/23 14:43, Hans de Goede wrote:

Owen, Kai-Heng thank you for testing. I've submitted these patches
to Rafael (the ACPI maintainer) now (with you on the Cc).
Hopefully they will get merged soon.


That's great, thanks!

Owen



Re: [Nouveau] Patch to TroubleShooting.html

2022-09-22 Thread Owen T. Heisler

On 2022-09-20 01:15, Owen T. Heisler wrote:
I opened an issue and merge request with your suggested change. See 
<https://gitlab.freedesktop.org/nouveau/wiki/-/issues/12> and 
<https://gitlab.freedesktop.org/nouveau/wiki/-/merge_requests/25>.



Hi Klaus,

Karol Herbst merged your change and it's now live at 
<https://nouveau.freedesktop.org/TroubleShooting.html>. Thanks for 
reporting the problem and writing the updated text.


Owen


Re: [Nouveau] Patch to TroubleShooting.html

2022-09-20 Thread Owen T. Heisler

On 2022-09-08 00:08, Owen T. Heisler wrote:

You could open a new issue for the wiki (registration required) here:

https://gitlab.freedesktop.org/nouveau/wiki/-/issues/new

Or I can open an issue for you, using the body of your original message.



Hi Klaus,

I opened an issue and merge request with your suggested change. See 
<https://gitlab.freedesktop.org/nouveau/wiki/-/issues/12> and 
<https://gitlab.freedesktop.org/nouveau/wiki/-/merge_requests/25>.


You can preview the change at 
<https://owenh.pages.freedesktop.org/-/nouveau-wiki/-/jobs/28612539/artifacts/public/TroubleShooting.html#kernelmodesettingorxsetsabadornon-nativedisplaymode>.


If anyone can review this change, please do.

Thanks,
Owen


Re: [Nouveau] Ubuntu 22.04 LTS system freezes 5 minutes then unlocks on nouveau, was stable on 20.04 w/nvidia

2022-08-18 Thread Owen T. Heisler

On 2022-08-18 13:02, David G. Pickett wrote:
How can I help you find the bug?  Being both a 20 year hardware and 25 
year software computer veteran, I can follow requests pretty well.


Hi David!

Please read the Bugs page on the wiki:



There are details there (and on linked pages) about how to report a bug. 
Since your problem does not seem to be related to 3D acceleration, you 
need to register and report your bug on the freedesktop.org GitLab instance:




Please note, I am not a nouveau developer; I'm just trying to help.

Owen