## Follow-up: v2 patches sent to netdev, follow-up PR to raspberrypi/linux Update on the patch series originally posted as `[RFC PATCH net-next 0/3]` (cilium/cilium#43198 and this bug). Three things have happened since:
### 1. v1 RFC drew constructive review on netdev - Andrea della Porta (SUSE, 2026-05-05) replicated the silent stall on `v6.19.10 + the RFC series` with patch 3's `netdev_warn_once` swapped to `netdev_warn`, and confirmed the kernel-level event pattern matches the userspace-visible symptom on his side. His findings prompted me to rerun the dmesg sweep across our fleet with the same change and confirmed: **patches 1 and 2 of the RFC are partial at best, patch 3's watchdog is empirically the load-bearing fix on this platform.** - Théo Lebrun (Bootlin, 2026-05-14) — new macb maintainer succeeding Nicolas Ferre / Claudiu Beznea — asked four clarifying questions on the cover letter (Tx-only vs Tx+Rx scope, recovery method diversity, TSO+SG- off workaround, the `cdns,*-max-pipe` DT-props lead). Answers in the threaded reply on lore. ### 2. Self-audit caught a regression in v1 patch 2 While preparing v2 I caught a real bug introduced by v1 patch 2: it read `ISR` in `macb_tx_poll()` and masked the value down to TCOMP, but on read-clear ISR silicon — which `raspberrypi_rp1_config` defaults to — the read consumes **every** bit set in the register. Any RCOMP / ROVR / TXUBR / etc. set at the moment of the read was silently dropped on the floor; on a level-triggered IRQ the consumed bit dropped the line before GIC delivery, so the IRQ for that event was never raised at all. The window is small (~200-500 ns per `macb_tx_poll` completion) but real, and at moderate-to-heavy RX load the math works out to non- negligible RX-completion loss. Disclosed to netdev in the reply. ### 3. v2 sent to netdev + follow-up PR to raspberrypi/linux **Mainline v2 (3 patches + cover):** - lore: https://lore.kernel.org/netdev/[email protected]/T/ - Patch 1: gates the PCIe-posted-write flush behind a new `MACB_CAPS_PCIE_POSTED_WRITES` cap (RP1-only) so SoC-integrated macb variants don't pay the readback cost - Patch 2: replaces the destructive ISR read with `(void)queue_readl(queue, IMR)` — non-destructive PCIe barrier, same flush effect - Patch 3: adds `netif_carrier_ok()` gate (eliminates a boot-time false positive at autoneg-pending) and swaps `netdev_warn_once` → `netdev_warn_ratelimited` **raspberrypi/linux follow-up PR:** https://github.com/raspberrypi/linux/pull/7369 — three commits porting the same v2 fixes against `rpi-6.18.y` (which already merged the original series at #7340 on 2026-05-08). For `linux-raspi` package users specifically: once mainline v2 is accepted into `net-next` (or the raspberrypi/linux follow-up PR is merged), the fix should flow through Canonical's kernel SRU process. I don't have visibility into when that lands. ### TSO + SG workaround note Several commenters here and on cilium/#43198 (rtheobald in particular) confirmed that disabling **both** TSO and SG (scatter-gather) via `ethtool -K eth0 tso off sg off` prevents the stall on `linux-raspi 6.17.0-1004`. That matches a descriptor-fragment-path interaction. I tested TSO+GSO off (not SG) fleet-wide and the stall still fired; SG-off specifically is on my v2 cover as an outstanding canary if reviewers want it. Either way, the kernel-side fixes in v2 close the underlying race rather than masking it. ### Runtime so far on the v2 fixes 24-node Raspberry Pi 5 fleet running a vendor-fork-anchored build of v2 patch 2 (the destructive-ISR fix) since 2026-05-14 ~14:00 UTC; ~120 cumulative node-hours so far, **zero mid-runtime TX stalls** and zero userspace-watchdog RECOVER events. Pre-patch baseline (~0.5 stall/node- hour at fleet level) would have predicted ~60 stalls in that window. Will report back as more uptime accumulates. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2133877 Title: Complete network hang on Raspberry Pi 5 with kernel 6.17 under load - possibly related to CPU frequency scaling To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
