Additional data from Raspberry Pi 5 cluster running Ubuntu 25.10 + kernel 6.17.0-1004-raspi:
ENVIRONMENT ----------- - Hardware: 2x Raspberry Pi 5 Model B Rev 1.0 (revision 0x00d04170) - OS: Ubuntu 25.10 (Questing Quokka) - Kernel: 6.17.0-1004-raspi - Ethernet: Cadence GEM (macb) via RP1 PCIe southbridge - PHY: Broadcom BCM54213PE - CPU Governor: performance (workaround already applied) KEY FINDING: NETWORK DIES SILENTLY ---------------------------------- We captured precise logs from the moment of network death: Dec 08 10:20:03.988843 kernel: audit: apparmor="AUDIT" (last normal log) Dec 08 10:20:12.548244 k3s: context deadline exceeded (NETWORK ALREADY DEAD) Dec 08 10:20:42.668863 k3s: http2: client connection lost Dec 08 10:20:51.188763 k3s: read tcp 172.16.101.2->172.16.101.1:6443: i/o timeout Critical observation: Between 10:20:03 and 10:20:12 (9 seconds), the network died with ZERO kernel messages: - No "Link is Down" from macb driver - No PCIe errors - No IRQ errors - No DMA errors - No RP1 messages - The last macb kernel message was at boot time (Dec 02) RCU STALLS ARE CONSEQUENCE, NOT CAUSE ------------------------------------- Timeline analysis: - 05:09:56 — 1 RCU stall (5 hours BEFORE network death) - 10:20:12 — Network died (ZERO RCU stalls at this moment) - 13:56:56 — 1 RCU stall (3.5 hours AFTER network death) The RCU stall at 13:56 occurred because processes were stuck waiting for network I/O that would never complete. The network death itself triggered no RCU stalls. CONFIRMED ON MULTIPLE FIRMWARE VERSIONS --------------------------------------- Node 1: EEPROM 2025-05-08, RP1 fw eb39cfd516f8c90628aa9d91f52370aade5d0a55 — Network Death: Yes Node 2: EEPROM 2024-09-23 (older fw) — Network Death: Yes Node 2: EEPROM 2025-02-12, RP1 fw eb39cfd516f8c90628aa9d91f52370aade5d0a55 — Monitoring Both nodes experience identical symptoms regardless of EEPROM/firmware version. WHAT WE RULED OUT ----------------- - Thermal throttling (no warnings in logs) - Undervoltage (no warnings in logs) - Link down events (PHY stayed "up") - CPU governor (performance mode already set) - rp1-pio failure: Node 2 had "failed to contact RP1 firmware" error on every boot due to old EEPROM (2024-09-23). After EEPROM update to 2025-02-12, rp1-pio works. However, Node 1 always had rp1-pio working, yet BOTH nodes experienced identical network death. This confirms rp1-pio status is unrelated to network death. HYPOTHESIS ---------- The macb/GEM driver or RP1 ethernet controller enters a broken state without logging any error. The driver doesn't detect the failure, so no "Link is Down" message is generated. The PHY link layer appears up, but no packets are transmitted/received. This suggests a driver bug where error detection/logging is missing for this failure mode. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2133877 Title: Complete network hang on Raspberry Pi 5 with kernel 6.17 under load - possibly related to CPU frequency scaling To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
