Public bug reported:
SUMMARY
=======
Raspberry Pi 5 experiences complete network connectivity loss (not just
packet loss) under moderate load. The PHY link remains UP, but no
traffic passes. The system continues running locally but is completely
unreachable over the network. Only a power cycle recovers the system.
Potential workaround being tested: Setting CPU governor to "performance"
(fixed frequency) eliminates frequent RCU stalls that preceded the hang.
We hypothesize the issue may be related to CPU frequency transitions
interacting poorly with the RP1/macb ethernet driver, but this is not
yet confirmed.
HARDWARE
========
- Devices: 2x Raspberry Pi 5 Model B Rev 1.0 (Revision d04170) - identical units
- RP1 chip_id: 0x20001927
- RP1 Firmware: eb39cfd516f8c90628aa9d91f52370aade5d0a55
- Ethernet: Cadence GEM rev 0x00070109 via RP1 southbridge
- PHY: Broadcom BCM54213PE (rgmii-id mode)
- Storage: NVMe SSDs via Geekworm X1002 NVMe HAT:
- Node 1 (CP): Samsung SSD 990 EVO 1TB
- Node 2 (Worker): WD Green SN350 500GB
- Note: Fast NVMe storage on both nodes rules out slow I/O as contributing
factor
SOFTWARE
========
- OS: Ubuntu 25.10 (Questing Quokka) aarch64
- Kernel: 6.17.0-1004-raspi #4-Ubuntu SMP PREEMPT_DYNAMIC (package:
linux-image-6.17.0-1004-raspi 6.17.0-1004.4)
- Kernel source: Ubuntu linux-raspi (mainline + Ubuntu patches, NOT
raspberrypi/linux fork)
- Previous kernel: 6.14.0-1017-raspi (also installed, available for testing)
- CPU Governor (before workaround): ondemand
- Available frequencies: 1.5GHz - 2.4GHz (100MHz steps)
- Network stack: Cilium CNI 1.16.x with eBPF/XDP (kube-proxy replacement)
IMPORTANT NOTE ON TIMING
========================
The problems started appearing around the upgrade to Ubuntu 25.10 (which
brought kernel 6.17). This upgrade coincided with migration to Cilium
CNI with eBPF datapath. The hangs have been observed for approximately
one month now.
Systems are fully updated (phased updates ignored), but this has not
changed the situation.
It's possible that:
1. The issue is purely kernel 6.17 regression
2. The issue is Cilium eBPF interaction with macb driver
3. The issue is a combination of both under load
We cannot definitively attribute the issue to the kernel, as the Cilium
migration (heavy eBPF usage) could also be a factor. However, we have
not found similar reports from Cilium users on other hardware.
We have not yet tested with kernel 6.14 to isolate the cause.
SYMPTOMS
========
1. Complete network loss: System becomes completely unreachable over network
2. "No route to host" errors from other machines trying to connect
3. PHY link stays UP: No "Link is Down" messages in kernel logs
4. No ethtool errors: ethtool -S eth0 shows zero errors/drops
5. System continues running: Kernel continues logging, journald works, local
processes run
6. No recovery: Network does not recover on its own, even after hours
7. Power cycle required: Only pulling power and restarting recovers the system
8. Affects multiple nodes: Observed on two separate RPi5 units running as K3s
cluster nodes
TIMELINE OF INCIDENT (December 4, 2025)
=======================================
From affected node (k8s-cp-01 / 172.16.101.1):
10:20:27 - RCU stall detected (system already showing instability)
10:32:10 - Last TCP activity logged: "read tcp...i/o timeout"
10:32:26 - 16-second gap in logs (suspicious)
10:32:48 - NFS timeouts begin (symptom of network loss, not cause)
10:37:00 - RCU stalls resume
~11:30 - Power cycle applied (only way to recover)
From peer node (k8s-worker-01 / 172.16.101.2):
10:31:27 - Lease update timeout to master
10:31:29 - "http2: client connection lost"
10:31:39 - Server 172.16.101.1:6443 marked FAILED
10:31:40-10:32:05 - Flapping between FAILED/RECOVERING/PREFERRED
10:32:05 - "read tcp 172.16.101.2:53838->172.16.101.1:6443: i/o timeout"
10:32:18 - "dial tcp 172.16.101.1:6443: connect: no route to host"
KEY OBSERVATION: RCU STALLS
===========================
Before applying the workaround, the system exhibited frequent RCU
stalls:
Dec 04 10:20:27 k8s-cp-01 kernel: rcu: INFO: rcu_preempt detected expedited
stalls on CPUs/tasks: { P10112 } 21 jiffies s: 1086353 root: 0x0/T
Dec 04 10:37:00 k8s-cp-01 kernel: rcu: INFO: rcu_preempt detected expedited
stalls on CPUs/tasks: { P2984747 } 21 jiffies s: 1095093 root: 0x0/T
Dec 04 10:37:01 k8s-cp-01 kernel: rcu: INFO: rcu_preempt detected expedited
stalls on CPUs/tasks: { P2984896 } 21 jiffies s: 1095137 root: 0x0/T
[... many more ...]
Statistics:
- Previous boot (Dec 2-4): 99 RCU stalls total
- Current boot with ondemand governor: 57 RCU stalls in 26 minutes (~2.2 per
minute)
- After switching to performance governor: 0 RCU stalls in 5+ minutes
WORKAROUND BEING TESTED
=======================
Setting CPU governor to "performance" mode:
echo "performance" | sudo tee
/sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
Observed effects:
1. RCU stalls stopped immediately after applying
2. Locks CPU at maximum frequency (2.4GHz)
3. Prevents frequency scaling transitions
Note: This workaround was applied today (Dec 4, 2025) after the hang
incident. We have not yet confirmed whether it prevents future hangs -
only that it eliminates RCU stalls. Long-term stability testing is
ongoing.
This is a workaround, not a fix. It sacrifices power efficiency and may
increase thermal output. The root cause is not addressed.
HYPOTHESIS
==========
The issue appears to be related to CPU frequency transitions under
network load:
1. RP1 southbridge connects to CPU via PCIe
2. Ethernet (Cadence GEM) is on RP1, not directly on SoC (unlike RPi4's
bcmgenet)
3. Network IRQs must traverse: PHY -> Cadence GEM -> RP1 -> PCIe -> CPU
4. During frequency transition (ondemand scaling up/down), CPU briefly pauses
5. If this pause occurs during critical IRQ handling window:
- DMA buffers may overflow or timeout
- RP1 firmware may enter undefined state
- macb driver doesn't detect the failure (PHY link stays UP)
6. Result: Complete network hang requiring power cycle
Alternative hypothesis: Cilium eBPF programs attached to network
interfaces may have timing-sensitive code paths that don't tolerate IRQ
latency spikes caused by frequency transitions. This could explain why
the issue appeared after Cilium migration.
WHAT DOESN'T CAUSE THIS
=======================
Ruled out during investigation:
- Physical layer: No link down events in kernel logs, other node on same switch
remained operational
- NFS: Already using soft mounts, NFS timeouts are symptom not cause
- Memory pressure: Free memory available, no OOM events
- Thermal: No throttling messages
- Cable/switch: Power cycle only (no cable changes) fixes the issue
DIFFERENCES FROM RELATED ISSUES
===============================
Aspect | raspberrypi/linux #6855/#6898 | This Issue
--------------------|-------------------------------|---------------------------
Kernel | 6.12.x (raspi) | 6.17.0 (Ubuntu mainline)
Symptom | Packet loss/drops | Complete network hang
Recovery | Self-recovers | Power cycle only
RCU stalls | Not reported | 2+ per minute
Trigger | High traffic / EEE | Unknown (CPU freq scaling
suspected)
Workaround | Disable EEE | Testing: Lock CPU to max
frequency
REPRODUCTION
============
1. Raspberry Pi 5 running Ubuntu 25.10 with kernel 6.17.0-1004-raspi
2. CPU governor set to "ondemand" (default)
3. Run Kubernetes cluster (K3s with etcd, NFS mounts)
4. Wait - hangs occur unpredictably, timing ranges from hours to days
Important notes:
- Hangs occur during LOW activity periods, not under load
- Network is NOT heavily loaded when hangs occur
- Attempted to trigger hang with load testing - unsuccessful
- Cannot reliably reproduce on demand
REQUESTED INFORMATION
=====================
1. Is the macb driver in 6.17 tested with RP1 under frequency scaling scenarios?
2. Are there known timing constraints for IRQ handling on RP1 that could be
violated during CPU frequency transitions?
3. Should there be explicit synchronization between cpufreq and macb/RP1
drivers?
SYSTEM LOGS
===========
dmesg (boot):
[ 0.793494] rp1 0002:01:00.0: bar0 len 0x4000, start 0x1f00410000, end
0x1f00413fff, flags, 0x40200
[ 0.793502] rp1 0002:01:00.0: bar1 len 0x400000, start 0x1f00000000, end
0x1f003fffff, flags, 0x40200
[ 0.793510] rp1 0002:01:00.0: enabling device (0000 -> 0002)
[ 0.794114] rp1 0002:01:00.0: chip_id 0x20001927
[ 0.812980] genirq: irq_chip rp1_irq_chip did not update eff. affinity
mask of irq 167
[ 0.832084] macb 1f00100000.ethernet eth0: Cadence GEM rev 0x00070109 at
0x1f00100000 irq 115 (2c:cf:67:10:71:8d)
[ 5.103390] rp1-firmware rp1_firmware: RP1 Firmware version
eb39cfd516f8c90628aa9d91f52370aade5d0a55
[ 6.403020] macb 1f00100000.ethernet eth0: PHY
[1f00100000.ethernet-ffffffff:01] driver [Broadcom BCM54213PE] (irq=POLL)
[ 6.403738] macb 1f00100000.ethernet eth0: configuring for phy/rgmii-id
link mode
[ 6.408809] macb 1f00100000.ethernet: gem-ptp-timer ptp clock registered.
[ 10.529995] macb 1f00100000.ethernet eth0: Link is Up - 1Gbps/Full -
flow control tx
ethtool -S eth0 (during hang - no errors visible):
tx_carrier_sense_errors: 0
rx_frame_check_sequence_errors: 0
rx_length_field_frame_errors: 0
rx_symbol_errors: 0
rx_alignment_errors: 0
rx_resource_errors: 0
q0_rx_dropped: 0
q0_tx_dropped: 0
REFERENCES
==========
- My investigation log: https://github.com/lexfrei/k8s/issues/526 (detailed
timeline and analysis)
- Related but different: https://github.com/raspberrypi/linux/issues/6855 (EEE
packet loss)
- Related but different: https://github.com/raspberrypi/linux/issues/6898 (RX
drops not in ethtool)
- RCU stall on RPi4: https://github.com/raspberrypi/linux/issues/5665
- RP1 macb support commit:
https://github.com/raspberrypi/linux/commit/f9f0024bd9bf04a58b64bae356be4c04022d23bc
** Affects: linux-raspi (Ubuntu)
Importance: Undecided
Status: New
** Description changed:
SUMMARY
=======
Raspberry Pi 5 experiences complete network connectivity loss (not just
packet loss) under moderate load. The PHY link remains UP, but no
traffic passes. The system continues running locally but is completely
unreachable over the network. Only a power cycle recovers the system.
Potential workaround being tested: Setting CPU governor to "performance"
(fixed frequency) eliminates frequent RCU stalls that preceded the hang.
We hypothesize the issue may be related to CPU frequency transitions
interacting poorly with the RP1/macb ethernet driver, but this is not
yet confirmed.
HARDWARE
========
- Devices: 2x Raspberry Pi 5 Model B Rev 1.0 (Revision d04170) - identical
units
- RP1 chip_id: 0x20001927
- RP1 Firmware: eb39cfd516f8c90628aa9d91f52370aade5d0a55
- Ethernet: Cadence GEM rev 0x00070109 via RP1 southbridge
- PHY: Broadcom BCM54213PE (rgmii-id mode)
+ - Storage: NVMe SSDs via Geekworm X1002 NVMe HAT:
+ - Node 1 (CP): Samsung SSD 990 EVO 1TB
+ - Node 2 (Worker): WD Green SN350 500GB
+ - Note: Fast NVMe storage on both nodes rules out slow I/O as contributing
factor
SOFTWARE
========
- OS: Ubuntu 25.10 (Questing Quokka) aarch64
- Kernel: 6.17.0-1004-raspi #4-Ubuntu SMP PREEMPT_DYNAMIC (package:
linux-image-6.17.0-1004-raspi 6.17.0-1004.4)
- Kernel source: Ubuntu linux-raspi (mainline + Ubuntu patches, NOT
raspberrypi/linux fork)
- Previous kernel: 6.14.0-1017-raspi (also installed, available for testing)
- CPU Governor (before workaround): ondemand
- Available frequencies: 1.5GHz - 2.4GHz (100MHz steps)
- Network stack: Cilium CNI 1.16.x with eBPF/XDP (kube-proxy replacement)
IMPORTANT NOTE ON TIMING
========================
The problems started appearing around the upgrade to Ubuntu 25.10 (which
brought kernel 6.17). This upgrade coincided with migration to Cilium
CNI with eBPF datapath. The hangs have been observed for approximately
one month now.
Systems are fully updated (phased updates ignored), but this has not
changed the situation.
It's possible that:
1. The issue is purely kernel 6.17 regression
2. The issue is Cilium eBPF interaction with macb driver
3. The issue is a combination of both under load
We cannot definitively attribute the issue to the kernel, as the Cilium
migration (heavy eBPF usage) could also be a factor. However, we have
not found similar reports from Cilium users on other hardware.
We have not yet tested with kernel 6.14 to isolate the cause.
SYMPTOMS
========
1. Complete network loss: System becomes completely unreachable over network
2. "No route to host" errors from other machines trying to connect
3. PHY link stays UP: No "Link is Down" messages in kernel logs
4. No ethtool errors: ethtool -S eth0 shows zero errors/drops
5. System continues running: Kernel continues logging, journald works, local
processes run
6. No recovery: Network does not recover on its own, even after hours
7. Power cycle required: Only pulling power and restarting recovers the system
8. Affects multiple nodes: Observed on two separate RPi5 units running as K3s
cluster nodes
TIMELINE OF INCIDENT (December 4, 2025)
=======================================
From affected node (k8s-cp-01 / 172.16.101.1):
10:20:27 - RCU stall detected (system already showing instability)
10:32:10 - Last TCP activity logged: "read tcp...i/o timeout"
10:32:26 - 16-second gap in logs (suspicious)
10:32:48 - NFS timeouts begin (symptom of network loss, not cause)
10:37:00 - RCU stalls resume
~11:30 - Power cycle applied (only way to recover)
From peer node (k8s-worker-01 / 172.16.101.2):
10:31:27 - Lease update timeout to master
10:31:29 - "http2: client connection lost"
10:31:39 - Server 172.16.101.1:6443 marked FAILED
10:31:40-10:32:05 - Flapping between FAILED/RECOVERING/PREFERRED
10:32:05 - "read tcp 172.16.101.2:53838->172.16.101.1:6443: i/o timeout"
10:32:18 - "dial tcp 172.16.101.1:6443: connect: no route to host"
KEY OBSERVATION: RCU STALLS
===========================
Before applying the workaround, the system exhibited frequent RCU
stalls:
Dec 04 10:20:27 k8s-cp-01 kernel: rcu: INFO: rcu_preempt detected
expedited stalls on CPUs/tasks: { P10112 } 21 jiffies s: 1086353 root: 0x0/T
Dec 04 10:37:00 k8s-cp-01 kernel: rcu: INFO: rcu_preempt detected
expedited stalls on CPUs/tasks: { P2984747 } 21 jiffies s: 1095093 root: 0x0/T
Dec 04 10:37:01 k8s-cp-01 kernel: rcu: INFO: rcu_preempt detected
expedited stalls on CPUs/tasks: { P2984896 } 21 jiffies s: 1095137 root: 0x0/T
[... many more ...]
Statistics:
- Previous boot (Dec 2-4): 99 RCU stalls total
- Current boot with ondemand governor: 57 RCU stalls in 26 minutes (~2.2 per
minute)
- After switching to performance governor: 0 RCU stalls in 5+ minutes
WORKAROUND BEING TESTED
=======================
Setting CPU governor to "performance" mode:
echo "performance" | sudo tee
/sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
Observed effects:
1. RCU stalls stopped immediately after applying
2. Locks CPU at maximum frequency (2.4GHz)
3. Prevents frequency scaling transitions
Note: This workaround was applied today (Dec 4, 2025) after the hang
incident. We have not yet confirmed whether it prevents future hangs -
only that it eliminates RCU stalls. Long-term stability testing is
ongoing.
This is a workaround, not a fix. It sacrifices power efficiency and may
increase thermal output. The root cause is not addressed.
HYPOTHESIS
==========
The issue appears to be related to CPU frequency transitions under
network load:
1. RP1 southbridge connects to CPU via PCIe
2. Ethernet (Cadence GEM) is on RP1, not directly on SoC (unlike RPi4's
bcmgenet)
3. Network IRQs must traverse: PHY -> Cadence GEM -> RP1 -> PCIe -> CPU
4. During frequency transition (ondemand scaling up/down), CPU briefly pauses
5. If this pause occurs during critical IRQ handling window:
- DMA buffers may overflow or timeout
- RP1 firmware may enter undefined state
- macb driver doesn't detect the failure (PHY link stays UP)
6. Result: Complete network hang requiring power cycle
Alternative hypothesis: Cilium eBPF programs attached to network
interfaces may have timing-sensitive code paths that don't tolerate IRQ
latency spikes caused by frequency transitions. This could explain why
the issue appeared after Cilium migration.
WHAT DOESN'T CAUSE THIS
=======================
Ruled out during investigation:
- Physical layer: No link down events in kernel logs, other node on same
switch remained operational
- NFS: Already using soft mounts, NFS timeouts are symptom not cause
- Memory pressure: Free memory available, no OOM events
- Thermal: No throttling messages
- Cable/switch: Power cycle only (no cable changes) fixes the issue
DIFFERENCES FROM RELATED ISSUES
===============================
Aspect | raspberrypi/linux #6855/#6898 | This Issue
--------------------|-------------------------------|---------------------------
Kernel | 6.12.x (raspi) | 6.17.0 (Ubuntu mainline)
Symptom | Packet loss/drops | Complete network hang
Recovery | Self-recovers | Power cycle only
RCU stalls | Not reported | 2+ per minute
Trigger | High traffic / EEE | Unknown (CPU freq
scaling suspected)
Workaround | Disable EEE | Testing: Lock CPU to
max frequency
REPRODUCTION
============
1. Raspberry Pi 5 running Ubuntu 25.10 with kernel 6.17.0-1004-raspi
2. CPU governor set to "ondemand" (default)
- 3. Run moderate network load (Kubernetes cluster with etcd, NFS mounts)
- 4. Wait for RCU stalls to accumulate
- 5. Eventually network will hang completely
- 6. Note: Timing is unpredictable - can be hours to days
+ 3. Run Kubernetes cluster (K3s with etcd, NFS mounts)
+ 4. Wait - hangs occur unpredictably, timing ranges from hours to days
+
+ Important notes:
+ - Hangs occur during LOW activity periods, not under load
+ - Network is NOT heavily loaded when hangs occur
+ - Attempted to trigger hang with load testing - unsuccessful
+ - Cannot reliably reproduce on demand
REQUESTED INFORMATION
=====================
1. Is the macb driver in 6.17 tested with RP1 under frequency scaling
scenarios?
2. Are there known timing constraints for IRQ handling on RP1 that could be
violated during CPU frequency transitions?
3. Should there be explicit synchronization between cpufreq and macb/RP1
drivers?
SYSTEM LOGS
===========
dmesg (boot):
[ 0.793494] rp1 0002:01:00.0: bar0 len 0x4000, start 0x1f00410000, end
0x1f00413fff, flags, 0x40200
[ 0.793502] rp1 0002:01:00.0: bar1 len 0x400000, start 0x1f00000000,
end 0x1f003fffff, flags, 0x40200
[ 0.793510] rp1 0002:01:00.0: enabling device (0000 -> 0002)
[ 0.794114] rp1 0002:01:00.0: chip_id 0x20001927
[ 0.812980] genirq: irq_chip rp1_irq_chip did not update eff. affinity
mask of irq 167
[ 0.832084] macb 1f00100000.ethernet eth0: Cadence GEM rev 0x00070109
at 0x1f00100000 irq 115 (2c:cf:67:10:71:8d)
[ 5.103390] rp1-firmware rp1_firmware: RP1 Firmware version
eb39cfd516f8c90628aa9d91f52370aade5d0a55
[ 6.403020] macb 1f00100000.ethernet eth0: PHY
[1f00100000.ethernet-ffffffff:01] driver [Broadcom BCM54213PE] (irq=POLL)
[ 6.403738] macb 1f00100000.ethernet eth0: configuring for
phy/rgmii-id link mode
[ 6.408809] macb 1f00100000.ethernet: gem-ptp-timer ptp clock
registered.
[ 10.529995] macb 1f00100000.ethernet eth0: Link is Up - 1Gbps/Full -
flow control tx
ethtool -S eth0 (during hang - no errors visible):
tx_carrier_sense_errors: 0
rx_frame_check_sequence_errors: 0
rx_length_field_frame_errors: 0
rx_symbol_errors: 0
rx_alignment_errors: 0
rx_resource_errors: 0
q0_rx_dropped: 0
q0_tx_dropped: 0
REFERENCES
==========
- My investigation log: https://github.com/lexfrei/k8s/issues/526 (detailed
timeline and analysis)
- Related but different: https://github.com/raspberrypi/linux/issues/6855
(EEE packet loss)
- Related but different: https://github.com/raspberrypi/linux/issues/6898 (RX
drops not in ethtool)
- RCU stall on RPi4: https://github.com/raspberrypi/linux/issues/5665
- RP1 macb support commit:
https://github.com/raspberrypi/linux/commit/f9f0024bd9bf04a58b64bae356be4c04022d23bc
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2133877
Title:
Complete network hang on Raspberry Pi 5 with kernel 6.17 under load -
possibly related to CPU frequency scaling
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs