If anyone has to stay on this kernel, like I do, or they are just too
stubborn to downgrade, also like me, I've created a setup script to
enable watchdog with some timeouts that seem to work on my pi 5 with
Ubuntu 25.10 which I assume is going to be the same as you, if you are
here chasing this issue.

It will ping a list of IP addresses (should be on the same network but
not belong to the machine in question) and should ALL of them fail, will
unload/reload the failed macb driver module and check again after 10
seconds. Should this not work, then the system will reboot in ~60
seconds as a fallback.

The idea here is to detect the issue and attempt repair quickly to avoid
inter-node timeouts and degraded replicas that would have to be rebuilt.
However, to ensure robustness, the fallback reboot is still there if
required.

Watchdog uses a hardware timer that must be "petted" every so often or
it will reboot the system. Beware of boot loops and test
/etc/watchdog.d/ping-targets before starting/enabling the watchdog.
Should you get stuck in one, as I did when I set the watchdog timer too
low, spamming the node with this saved me:

ssh <node> sudo systemctl disable watchdog

Hope this helps

---

# Based on post in https://forums.raspberrypi.com/viewtopic.php?t=89527
by Denny Fox

sudo apt-get install watchdog -y

sudo mkdir -p /etc/watchdog.d

# set a range of ips that currently ping and are not local in 
/etc/watchdog.d/targets
{   
  host_ips=$(hostname -i)
  prefix=192.168.220.   # <--- Set prefix
  for i in 11 {25..29}  # <--- Set range of IPs, in this case: 11, 25, 26, 27, 
28, 29
  do ip=${prefix}${i} 
    grep -vq ${ip} <<<${host_ips} && ping -q -c1 -W2 ${ip} &> /dev/null && echo 
"${ip}"
  done
} | sudo tee /etc/watchdog.d/targets

# ping script /etc/watchdog.d/ping-targets
cat <<"EOF" | sudo tee /etc/watchdog.d/ping-targets
#!/usr/bin/env bash

# A test/repair script for the raspberry pi watchdog

# This script only returns an error if *all* the hosts listed in the datafile
# do not respond to ping

LOGFILE=/var/log/ping-targets.log

log() {
  echo $(date +'%Y%m%d %H:%M:%S') $@ >> ${LOGFILE}
}

# Watchdog calls us again with arg1 = repair if we signal an error in test mode
if [ "${1}" == "repair" ]
then
  log "attempting repair"
  log "unloading macb module"
  modprobe -r macb
  log "reloading macb module"
  modprobe macb
  log "sleeping for 10 seconds"
  sleep 10
  log "confirming repair - pinging targets"
fi

# Try to ping each IP and exit status 0 on any success 
while read ip || [[ -n $ip ]] 
do
  ping -q -c1 -W.1 ${ip} &> /dev/null && exit 0  # <--- Adjust -W ping timeout 
if needed
  log "${ip} ping failed"
done < /etc/watchdog.d/targets

log "all pings failed, exiting status 1"
exit 1

EOF

sudo chmod a+x /etc/watchdog.d/ping-targets

# Watchdog config
cat <<"EOF" | sudo tee /etc/watchdog.conf 
watchdog-device = /dev/watchdog
watchdog-timeout = 60
test-timeout = 60
repair-timeout = 60
interval = 2
retry-timeout = 0
realtime = yes
priority = 1
EOF

sudo systemctl start watchdog

# Recommend enabling the watchdog to survive reboots when you are happy this 
works for you
sudo systemctl enable watchdog

---

To monitor:

sudo tail -f /var/log/ping-targets.log -n 50

A successful repair for node with ip 192.168.220.27 would look something
like this (note that ping successes are not recorded to avoid spamming
the log):

20251222 12:07:08 192.168.220.11 ping failed
20251222 12:07:08 192.168.220.25 ping failed
20251222 12:07:08 192.168.220.26 ping failed
20251222 12:07:08 192.168.220.28 ping failed
20251222 12:07:08 192.168.220.29 ping failed
20251222 12:07:08 all pings failed, exiting status 1
20251222 12:07:10 attempting repair
20251222 12:07:10 unloading macb module
20251222 12:07:10 reloading macb module
20251222 12:07:10 sleeping for 10 seconds
20251222 12:07:20 confirming repair - pinging targets

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2133877

Title:
  Complete network hang on Raspberry Pi 5 with kernel 6.17 under load -
  possibly related to CPU frequency scaling

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to