Public bug reported:
# Our problem
We are running multiple K8S clusters on Ubuntu 24.04.1 LTS nodes.
On one of these clusters, we have noticed at least twice that most of the nodes
(~5 out of 8) went offline without any action on our side.
To restore connectivity, we tried ifdown/ifup, disconnect/connect network from
hypervisor and networking service restart but nothing helped, we had to reboot
the nodes from the console.
After some investigations, we were able to correlate this outage with the
`apt-daily-upgrade` service run triggered by the `apt-daily-upgrade` timer.
Somehow, the `apt-daily-upgrade` service updated a package which triggered a
`systemctl daemon-reexec`, cutting network connectivity in the process.
# Symptoms
Node is flagged as `NotReady` by K8s
SSH connection to node is not working
From the node, we can't ping the gateway
The output of `systemctl daemon-reexec` in `journalctl` is way more verbose
than usual :
```
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Reexecuting requested from client
PID 2711048 ('systemctl') (unit apt-daily-upgrade.service)...
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Reexecuting.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: systemd 255.4-1ubuntu8.5 running
in system mode (+PAM +AUDIT +SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT
-GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD
+LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY +P11KIT +
QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -BPF_FRAMEWORK -XKBCOMMON +UTMP
+SYSVINIT default-hierarchy=unified)
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Detected virtualization vmware.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Detected architecture x86-64.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Starting man-db.service - Daily
man-db regeneration...
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopping containerd.service -
containerd container runtime...
févr. 21 06:06:55 lylux0634kdp004 ntpd[1106]: ERR: ntpd exiting on signal 15
(Terminated)
févr. 21 06:06:55 lylux0634kdp004 ntpd[1106]: PROTO: 172.16.10.254 unlink local
addr 172.16.34.4 -> <null>
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopping ntpsec.service - Network
Time Service...
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopping open-vm-tools.service -
Service for virtual machines hosted on VMware...
févr. 21 06:06:55 lylux0634kdp004 systemd-journald[504]: Journal stopped
févr. 21 06:06:55 lylux0634kdp004 systemd-journald[504]: Received SIGTERM from
PID 1 (systemd).
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopping systemd-journald.service
- Journal Service...
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: ntpsec.service: Deactivated
successfully.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopped ntpsec.service - Network
Time Service.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: ntpsec.service: Consumed 1min
12.819s CPU time, 12.4M memory peak, 0B memory swap peak.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Deactivated
successfully.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Unit process
3374 (containerd-shim) remains running after unit stopped.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Unit process
3375 (containerd-shim) remains running after unit stopped.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Unit process
3475 (containerd-shim) remains running after unit stopped.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Unit process
3512 (containerd-shim) remains running after unit stopped.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Unit process
3545 (containerd-shim) remains running after unit stopped.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Unit process
3618 (containerd-shim) remains running after unit stopped.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Unit process
2574706 (containerd-shim) remains running after unit stopped.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopped containerd.service -
containerd container runtime.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Consumed 9min
54.298s CPU time, 3.4G memory peak, 0B memory swap peak.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Found
left-over process 3374 (containerd-shim) in control group while starting unit.
Ignoring.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: This usually
indicates unclean termination of a previous run, or service implementation
deficiencies.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Found
left-over process 3375 (containerd-shim) in control group while starting unit.
Ignoring.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: This usually
indicates unclean termination of a previous run, or service implementation
deficiencies.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Found
left-over process 3475 (containerd-shim) in control group while starting unit.
Ignoring.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: This usually
indicates unclean termination of a previous run, or service implementation
deficiencies.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Found
left-over process 3512 (containerd-shim) in control group while starting unit.
Ignoring.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: This usually
indicates unclean termination of a previous run, or service implementation
deficiencies.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Found
left-over process 3545 (containerd-shim) in control group while starting unit.
Ignoring.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: This usually
indicates unclean termination of a previous run, or service implementation
deficiencies.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Found
left-over process 3618 (containerd-shim) in control group while starting unit.
Ignoring.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: This usually
indicates unclean termination of a previous run, or service implementation
deficiencies.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Found
left-over process 2574706 (containerd-shim) in control group while starting
unit. Ignoring.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: This usually
indicates unclean termination of a previous run, or service implementation
deficiencies.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Starting containerd.service -
containerd container runtime...
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: netplan-ovs-cleanup.service -
OpenVSwitch configuration for cleanup was skipped because of an unmet condition
check (ConditionFileIsExecutable=/usr/bin/ovs-vsctl).
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Starting ntpsec.service - Network
Time Service...
févr. 21 06:06:55 lylux0634kdp004 systemd[1]:
systemd-networkd-wait-online.service: Deactivated successfully.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopped
systemd-networkd-wait-online.service - Wait for Network to be Configured.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopping
systemd-networkd-wait-online.service - Wait for Network to be Configured...
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopping systemd-networkd.service
- Network Configuration...
```
The `Found left-over process` lines made me think of bug
https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/2013543 but from
my understanding, we whould not be impacted on Noble hosts.
# Testcase
Here is the catch : we can't reproduce the issue on-demand.
When manually running `systemctl daemon-reexec`, we are not experiencing
the same outage and journalctl is only logging 5 lines :
```
févr. 21 11:01:06 lylux0634kdp004 systemd[1]: Reexecuting requested from client
PID 23296 ('systemctl') (unit session-2.scope)...
févr. 21 11:01:06 lylux0634kdp004 systemd[1]: Reexecuting.
févr. 21 11:01:06 lylux0634kdp004 systemd[1]: systemd 255.4-1ubuntu8.5 running
in system mode (+PAM +AUDIT +SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT
-GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD
+LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY +P11KIT >
févr. 21 11:01:06 lylux0634kdp004 systemd[1]: Detected virtualization vmware.
févr. 21 11:01:06 lylux0634kdp004 systemd[1]: Detected architecture x86-64.
```
# Some aditional details
root@lylux0634kdp004:~# lsb_release -d
No LSB modules are available.
Description: Ubuntu 24.04.1 LTS
root@lylux0634kdp004:~# apt-cache policy systemd
systemd:
Installé : 255.4-1ubuntu8.5
Candidat : 255.4-1ubuntu8.5
Table de version :
*** 255.4-1ubuntu8.5 500
500 https://XXXXXX/ubuntu-fr noble-updates/main amd64 Packages
100 /var/lib/dpkg/status
255.4-1ubuntu8 500
500 https://XXXXX/ubuntu-fr noble/main amd64 Packages
root@lylux0634kdp004:~# uname -a
Linux lylux0634kdp004 6.8.0-52-generic #53-Ubuntu SMP PREEMPT_DYNAMIC Sat Jan
11 00:06:25 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Feel free to request any aditional details that would be of any help in
the troubleshooting of this issue.
Antoine
** Affects: systemd (Ubuntu)
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2099676
Title:
Network connectivity loss after systemctl daemon-reexec
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/2099676/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs