This bug describes my exact symptom. I reproduce it on a different OEM (Lenovo 
ThinkPad P16 Gen 3) and a different distro (NixOS 26.05) on **both kernel 7.0.3 
and 7.1.0-rc1**, confirming this is not OEM- or distro-specific and the 7.1-rc1 
changes don't address the root cause.

---

## Hardware
- **Laptop:** Lenovo ThinkPad P16 Gen 3 (21RQ003TGE), BIOS N4FET30W (1.11)
- **iGPU:** Intel Arrow Lake-S [Intel Graphics] — PCI `8086:7d67` rev 06 
(matches reporter)
- **dGPU:** NVIDIA RTX PRO 4000 Blackwell Laptop — PCI `10de:2c39` (PRIME sync, 
proprietary 595.71.05 open kernel module)
- **Internal panel:** eDP-1 on iGPU
- **External monitor:** DP-5 wired to NVIDIA dGPU directly

## OS / Kernel
- **NixOS 26.05 (Yarara)**
- **Kernels tested:** Linux 7.0.3, Linux 7.1.0-rc1 — bug fires on both
- Display server: KDE Plasma 6.6.4 on **Wayland**, SDDM
- Kernel cmdline (current): `mem_sleep_default=s2idle xe.force_probe=7d67 
xe.enable_psr=0 xe.enable_dc=0 nvidia-drm.fbdev=0 fbcon=nodefer`

## Trigger (refines the reporter's "long s2idle dwell" description)
On my system there is **no system suspend** involved — no `PM: suspend entry` 
in the journal. The trigger is **autonomous PCI runtime PM suspend of the 
iGPU**. After ~1–2h of low activity, 
`/sys/bus/pci/devices/0000:00:02.0/power/runtime_suspended_time` accumulates 
suspended time even though the user is at the laptop. Each runtime-suspend → 
runtime-resume hits the buggy C10 PHY power-down/up path.

Evidence from my system before applying the workaround:
- 19h uptime, no system suspend
- iGPU `runtime_suspended_time` = **6,204,262 ms ≈ 1.72 hours** of autonomous 
suspension
- 15 occurrences of the error stack in dmesg over those 19h
- Overnight idle eventually wedged the panel completely (next-morning wake 
produced "lit-up black" — backlight on, no compositor output)

This suggests the bug is at the PCI runtime PM tier, not specific to
system s2idle. The reporter's "long s2idle dwell" finding is one
trigger; another is just leaving the machine idle long enough for PCI-PM
to suspend the device repeatedly.

## Error stack (byte-for-byte identical to reporter's, module is `[xe]` instead 
of `[i915]`)
```
xe 0000:00:02.0: [drm] *ERROR* Timeout waiting for DDI BUF A to get active
xe 0000:00:02.0: [drm] *ERROR* Timed out waiting for DP idle patterns
xe 0000:00:02.0: [drm] *ERROR* [CRTC:151:pipe A] flip_done timed out
xe 0000:00:02.0: [drm] *ERROR* [CRTC:151:pipe A] mismatch in pixel_rate 
(expected 592800, found 44965)
xe 0000:00:02.0: [drm] *ERROR* [CRTC:151:pipe A] mismatch in dpll_hw_state
xe 0000:00:02.0: [drm] *ERROR* expected:
xe 0000:00:02.0: [drm] *ERROR* cx0pll_hw_state: lane_count: 4, ssc_enabled: no, 
use_c10: yes, tbt_mode: no
xe 0000:00:02.0: [drm] *ERROR* c10pll_hw_state: fracen: yes,
xe 0000:00:02.0: [drm] *ERROR* quot: 61440, rem: 0, den: 1,
xe 0000:00:02.0: [drm] *ERROR* multiplier: 210, tx_clk_div: 0.
xe 0000:00:02.0: [drm] *ERROR* found:
xe 0000:00:02.0: [drm] *ERROR* c10pll_hw_state: fracen: no,
xe 0000:00:02.0: [drm] *ERROR* multiplier: 16, tx_clk_div: 0.
xe 0000:00:02.0: [drm] *ERROR* [CRTC:151:pipe A] mismatch in port_clock 
(expected 810000, found 61440)
xe 0000:00:02.0: [drm] pipe state doesn't match!
WARNING: drivers/gpu/drm/i915/display/intel_modeset_verify.c:225 at 
intel_modeset_verify_crtc+0x324/0x520 [xe]
xe 0000:00:02.0: [drm] DPLL 0: pll hw state mismatch
WARNING: drivers/gpu/drm/i915/display/intel_dpll_mgr.c:5132 at 
verify_single_dpll_state+0x2db/0x590 [xe]
xe 0000:00:02.0: [drm] PHY A failed to change powerdown state
```

Same expected vs found clocks (810000 / 592800 vs 61440 / 44965), same
`multiplier 210 → 16`, same path through `intel_modeset_verify_crtc` →
`verify_single_dpll_state`.

## Workarounds tested

| Knob | Effect on my system |
|---|---|
| `i915.enable_psr=0` (kernel 7.0.3, i915 bound) | original config; bug still 
fires, hard wedge (display never recovers) |
| switch i915 → xe via `i915` blacklist + `xe.force_probe=7d67` | display 
recovers in ~40s instead of staying wedged forever, but the bug still fires |
| `xe.enable_psr=0`, `xe.enable_dc=0`, `nvidia-drm.fbdev=0`, `fbcon=nodefer` | 
same as above — recovery, not prevention |
| upgrade 7.0.3 → 7.1.0-rc1 (linuxPackages_testing in nixpkgs) | **no change** 
— same error signature, 15 occurrences in 19h |
| **`echo on > /sys/bus/pci/devices/0000:00:02.0/power/control`** | **prevents 
bug entirely** — runtime-suspended_time stays at 0, no PHY errors |

The PCI runtime-PM pin is the only workaround that prevents the bug from
firing on my hardware. It costs a few watts of idle power but is
reliable.

We've encoded it as a permanent udev rule in NixOS:
```udev
SUBSYSTEM=="pci", ATTR{vendor}=="0x8086", ATTR{device}=="0x7d67", 
ATTR{power/control}="on"
```

## Confirms reporter's analysis

The reporter wrote: *"This is in
`drivers/gpu/drm/i915/display/intel_cx0_phy.c` / DPLL state restore —
below the layers reachable by `enable_psr` / `enable_dc` /
`enable_fbc`."* My test of `xe.enable_dc=0` on top of `xe.enable_psr=0`
on 7.1-rc1 confirms: these knobs don't reach the racing code path. Only
preventing the device from suspending at all (via PCI runtime PM) avoids
the bug.

The reporter also noted: *"`xe.force_probe=7d67` was also tested as a
workaround. xe binds cleanly on this device but suffers a different bug:
`Tile0: GT0: Engine reset engine_class=rcs guc_id=48 state=0x289`."* —
We did not see this specific xe engine reset, but we did see xe-side
issues in a different form (vblank wait spam from
`drm_fb_helper_damage_work` on `nvidia-drmdrmfb` + `xedrmfb` parallel
fbdev clients, fixed by `nvidia-drm.fbdev=0 fbcon=nodefer`).

## NVIDIA dGPU exonerated on my side too

Like the reporter, we confirmed the NVIDIA proprietary driver is not
implicated. `nvidia-suspend.service` / `nvidia-resume.service` finish
cleanly. The bug is purely on the Intel display side.

## Suggested follow-up for upstream

The Mika Kahola "C10/C20 PHY PLL divider verification" patches that
landed in 7.1-rc1 are refactoring/verification work; they do not prevent
the underlying PHY-stuck-mid-powerdown race. A proper fix would need to
either:

1. Serialize PCI runtime PM resume against the C10 PHY power-state machine so 
the resume sequence doesn't race against a partially-completed previous 
powerdown.
2. On runtime-resume, detect the stuck PHY state and run an explicit re-init 
sequence before the first atomic commit.

In the meantime, please consider documenting the `power/control=on`
workaround in `Documentation/gpu/i915.rst` / `Documentation/gpu/xe/` for
Arrow Lake-S users, since none of the published module parameters
prevent the bug.

---

Happy to provide more data (drm.debug=0x1e dumps, full journal, etc.) if
useful.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2150605

Title:
  `i915 Arrow Lake-S: PHY A / C10 DPLL state mismatch on resume from
  long s2idle dwell — slow wake (5-10s) with retry storm`

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2150605/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to