## Follow-up: v2 patches sent to netdev, follow-up PR to
raspberrypi/linux

Update on the patch series originally posted as `[RFC PATCH net-next
0/3]` (cilium/cilium#43198 and this bug). Three things have happened
since:

### 1. v1 RFC drew constructive review on netdev

- Andrea della Porta (SUSE, 2026-05-05) replicated the silent stall on
`v6.19.10 + the RFC series` with patch 3's `netdev_warn_once` swapped to
`netdev_warn`, and confirmed the kernel-level event pattern matches the
userspace-visible symptom on his side. His findings prompted me to rerun
the dmesg sweep across our fleet with the same change and confirmed:
**patches 1 and 2 of the RFC are partial at best, patch 3's watchdog is
empirically the load-bearing fix on this platform.**

- Théo Lebrun (Bootlin, 2026-05-14) — new macb maintainer succeeding
Nicolas Ferre / Claudiu Beznea — asked four clarifying questions on the
cover letter (Tx-only vs Tx+Rx scope, recovery method diversity, TSO+SG-
off workaround, the `cdns,*-max-pipe` DT-props lead). Answers in the
threaded reply on lore.

### 2. Self-audit caught a regression in v1 patch 2

While preparing v2 I caught a real bug introduced by v1 patch 2: it read
`ISR` in `macb_tx_poll()` and masked the value down to TCOMP, but on
read-clear ISR silicon — which `raspberrypi_rp1_config` defaults to —
the read consumes **every** bit set in the register. Any RCOMP / ROVR /
TXUBR / etc. set at the moment of the read was silently dropped on the
floor; on a level-triggered IRQ the consumed bit dropped the line before
GIC delivery, so the IRQ for that event was never raised at all.

The window is small (~200-500 ns per `macb_tx_poll` completion) but
real, and at moderate-to-heavy RX load the math works out to non-
negligible RX-completion loss. Disclosed to netdev in the reply.

### 3. v2 sent to netdev + follow-up PR to raspberrypi/linux

**Mainline v2 (3 patches + cover):**
- lore: 
https://lore.kernel.org/netdev/[email protected]/T/
- Patch 1: gates the PCIe-posted-write flush behind a new 
`MACB_CAPS_PCIE_POSTED_WRITES` cap (RP1-only) so SoC-integrated macb variants 
don't pay the readback cost
- Patch 2: replaces the destructive ISR read with `(void)queue_readl(queue, 
IMR)` — non-destructive PCIe barrier, same flush effect
- Patch 3: adds `netif_carrier_ok()` gate (eliminates a boot-time false 
positive at autoneg-pending) and swaps `netdev_warn_once` → 
`netdev_warn_ratelimited`

**raspberrypi/linux follow-up PR:**
https://github.com/raspberrypi/linux/pull/7369 — three commits porting
the same v2 fixes against `rpi-6.18.y` (which already merged the
original series at #7340 on 2026-05-08).

For `linux-raspi` package users specifically: once mainline v2 is
accepted into `net-next` (or the raspberrypi/linux follow-up PR is
merged), the fix should flow through Canonical's kernel SRU process. I
don't have visibility into when that lands.

### TSO + SG workaround note

Several commenters here and on cilium/#43198 (rtheobald in particular)
confirmed that disabling **both** TSO and SG (scatter-gather) via
`ethtool -K eth0 tso off sg off` prevents the stall on `linux-raspi
6.17.0-1004`. That matches a descriptor-fragment-path interaction. I
tested TSO+GSO off (not SG) fleet-wide and the stall still fired; SG-off
specifically is on my v2 cover as an outstanding canary if reviewers
want it. Either way, the kernel-side fixes in v2 close the underlying
race rather than masking it.

### Runtime so far on the v2 fixes

24-node Raspberry Pi 5 fleet running a vendor-fork-anchored build of v2
patch 2 (the destructive-ISR fix) since 2026-05-14 ~14:00 UTC; ~120
cumulative node-hours so far, **zero mid-runtime TX stalls** and zero
userspace-watchdog RECOVER events. Pre-patch baseline (~0.5 stall/node-
hour at fleet level) would have predicted ~60 stalls in that window.

Will report back as more uptime accumulates.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2133877

Title:
  Complete network hang on Raspberry Pi 5 with kernel 6.17 under load -
  possibly related to CPU frequency scaling

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to