[Bug 1815101] Re: [master] Restarting systemd-networkd breaks keepalived clusters
Alright, As this is a problem that does not only affect keepalived, but, all cluster-like softwares dealing with aliases in any existing interface, managed or not by systemd, I have tested the same test case in a pacemaker based cluster, with 3 nodes, having 1 virtual IP + a lighttpd instance running in the same resource group: (k)inaddy@kcluster01:~$ crm config show node 1: kcluster01 node 2: kcluster02 node 3: kcluster03 primitive fence_kcluster01 stonith:fence_virsh \ params ipaddr=192.168.100.205 plug=kcluster01 action=off login=stonithmgr passwd= use_sudo=true delay=2 \ op monitor interval=60s primitive fence_kcluster02 stonith:fence_virsh \ params ipaddr=192.168.100.205 plug=kcluster02 action=off login=stonithmgr passwd= use_sudo=true delay=4 \ op monitor interval=60s primitive fence_kcluster03 stonith:fence_virsh \ params ipaddr=192.168.100.205 plug=kcluster03 action=off login=stonithmgr passwd= use_sudo=true delay=6 \ op monitor interval=60s primitive virtual_ip IPaddr2 \ params ip=10.0.3.1 nic=eth3 \ op monitor interval=10s primitive webserver systemd:lighttpd \ op monitor interval=10 timeout=60 group webserver_virtual_ip webserver virtual_ip location l_fence_kcluster01 fence_kcluster01 -inf: kcluster01 location l_fence_kcluster02 fence_kcluster02 -inf: kcluster02 location l_fence_kcluster03 fence_kcluster03 -inf: kcluster03 property cib-bootstrap-options: \ have-watchdog=true \ dc-version=2.0.1-9e909a5bdd \ cluster-infrastructure=corosync \ cluster-name=debian \ stonith-enabled=true \ stonith-action=off \ no-quorum-policy=stop (k)inaddy@kcluster01:~$ cat /etc/netplan/cluster.yaml network: version: 2 renderer: networkd ethernets: eth1: dhcp4: no dhcp6: no addresses: [10.0.1.2/24] eth2: dhcp4: no dhcp6: no addresses: [10.0.2.2/24] eth3: dhcp4: no dhcp6: no addresses: [10.0.3.2/24] eth4: dhcp4: no dhcp6: no addresses: [10.0.4.2/24] eth5: dhcp4: no dhcp6: no addresses: [10.0.5.2/24] AND the virtual IP failed right after the netplan acted in systemd network interface. (k)inaddy@kcluster03:~$ sudo netplan apply (k)inaddy@kcluster03:~$ ping 10.0.3.1 PING 10.0.3.1 (10.0.3.1) 56(84) bytes of data. >From 10.0.3.4 icmp_seq=1 Destination Host Unreachable >From 10.0.3.4 icmp_seq=2 Destination Host Unreachable >From 10.0.3.4 icmp_seq=3 Destination Host Unreachable >From 10.0.3.4 icmp_seq=4 Destination Host Unreachable >From 10.0.3.4 icmp_seq=5 Destination Host Unreachable >From 10.0.3.4 icmp_seq=6 Destination Host Unreachable 64 bytes from 10.0.3.1: icmp_seq=7 ttl=64 time=0.088 ms 64 bytes from 10.0.3.1: icmp_seq=8 ttl=64 time=0.076 ms --- 10.0.3.1 ping statistics --- 8 packets transmitted, 2 received, +6 errors, 75% packet loss, time 7128ms rtt min/avg/max/mdev = 0.076/0.082/0.088/0.006 ms, pipe 4 Liked explained in this bug description. With that, virtual_ip_monitor, from pacemaker, realized the virtual IP was gone and re-started it in the same node: (k)inaddy@kcluster01:~$ crm status Stack: corosync Current DC: kcluster01 (version 2.0.1-9e909a5bdd) - partition with quorum Last updated: Wed Sep 25 13:11:05 2019 Last change: Wed Sep 25 12:49:56 2019 by root via cibadmin on kcluster01 3 nodes configured 5 resources configured Online: [ kcluster01 kcluster02 kcluster03 ] Full list of resources: fence_kcluster01 (stonith:fence_virsh): Started kcluster02 fence_kcluster02 (stonith:fence_virsh): Started kcluster01 fence_kcluster03 (stonith:fence_virsh): Started kcluster01 Resource Group: webserver_virtual_ip webserver (systemd:lighttpd): Started kcluster03 virtual_ip (ocf::heartbeat:IPaddr2): FAILED kcluster03 Failed Resource Actions: * virtual_ip_monitor_1 on kcluster03 'not running' (7): call=100, status=complete, exitreason='', last-rc-change='Wed Sep 25 13:11:05 2019', queued=0ms, exec=0ms (k)inaddy@kcluster01:~$ crm status Stack: corosync Current DC: kcluster01 (version 2.0.1-9e909a5bdd) - partition with quorum Last updated: Wed Sep 25 13:11:07 2019 Last change: Wed Sep 25 12:49:56 2019 by root via cibadmin on kcluster01 3 nodes configured 5 resources configured Online: [ kcluster01 kcluster02 kcluster03 ] Full list of resources: fence_kcluster01 (stonith:fence_virsh): Started kcluster02 fence_kcluster02 (stonith:fence_virsh): Started kcluster01 fence_kcluster03 (stonith:fence_virsh): Started kcluster01 Resource Group: webserver_virtual_ip webserver (systemd:lighttpd): Started kcluster03 virtual_ip (ocf::heartbeat:IPaddr2): Started kcluster03 Failed Resource Actions: * virtual_ip_monitor_1
[Bug 1815101] Re: [master] Restarting systemd-networkd breaks keepalived clusters
** Also affects: heartbeat (Ubuntu) Importance: Undecided Status: New ** Changed in: heartbeat (Ubuntu Bionic) Importance: Undecided => Medium ** Changed in: heartbeat (Ubuntu Bionic) Status: New => Triaged ** Changed in: heartbeat (Ubuntu Disco) Importance: Undecided => Medium ** Changed in: heartbeat (Ubuntu Disco) Status: New => Triaged ** Changed in: heartbeat (Ubuntu Eoan) Importance: Undecided => Low ** Changed in: heartbeat (Ubuntu Eoan) Status: New => Triaged ** Changed in: heartbeat (Ubuntu Bionic) Assignee: (unassigned) => Rafael David Tinoco (rafaeldtinoco) ** Changed in: heartbeat (Ubuntu Disco) Assignee: (unassigned) => Rafael David Tinoco (rafaeldtinoco) ** Changed in: heartbeat (Ubuntu Eoan) Assignee: (unassigned) => Rafael David Tinoco (rafaeldtinoco) -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1815101 Title: [master] Restarting systemd-networkd breaks keepalived clusters To manage notifications about this bug go to: https://bugs.launchpad.net/netplan/+bug/1815101/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1815101] Re: [master] Restarting systemd-networkd breaks keepalived clusters
Based on comment #12, and other comments from other duplicate cases, I'll summarize here in a better (and consolidated way) how to reproduce the issue, how to mitigate it using the dummy workaround, and how to fix it (with the backports/merge requests). At the end I might provide a PPA asking for feedback. ** Changed in: netplan Status: Invalid => Confirmed ** Changed in: keepalived (Ubuntu) Status: Triaged => Confirmed ** Changed in: systemd (Ubuntu) Status: Triaged => Confirmed ** Also affects: keepalived (Ubuntu Eoan) Importance: Undecided Status: Confirmed ** Also affects: systemd (Ubuntu Eoan) Importance: Undecided Status: Confirmed ** Also affects: keepalived (Ubuntu Bionic) Importance: Undecided Status: New ** Also affects: systemd (Ubuntu Bionic) Importance: Undecided Status: New ** Also affects: keepalived (Ubuntu Disco) Importance: Undecided Status: New ** Also affects: systemd (Ubuntu Disco) Importance: Undecided Status: New ** Changed in: keepalived (Ubuntu Bionic) Status: New => Confirmed ** Changed in: keepalived (Ubuntu Disco) Status: New => Confirmed ** Changed in: systemd (Ubuntu Bionic) Status: New => Confirmed ** Changed in: systemd (Ubuntu Disco) Status: New => Confirmed ** Changed in: keepalived (Ubuntu Bionic) Importance: Undecided => Medium ** Changed in: keepalived (Ubuntu Disco) Importance: Undecided => Medium ** Changed in: keepalived (Ubuntu Eoan) Importance: Undecided => Medium ** Changed in: systemd (Ubuntu Bionic) Importance: Undecided => Medium ** Changed in: systemd (Ubuntu Disco) Importance: Undecided => Medium ** Changed in: systemd (Ubuntu Eoan) Importance: Undecided => Medium ** Changed in: keepalived (Ubuntu Bionic) Assignee: (unassigned) => Rafael David Tinoco (rafaeldtinoco) ** Changed in: keepalived (Ubuntu Disco) Assignee: (unassigned) => Rafael David Tinoco (rafaeldtinoco) ** Changed in: keepalived (Ubuntu Eoan) Assignee: (unassigned) => Rafael David Tinoco (rafaeldtinoco) ** Changed in: systemd (Ubuntu Bionic) Assignee: (unassigned) => Rafael David Tinoco (rafaeldtinoco) ** Changed in: systemd (Ubuntu Disco) Assignee: (unassigned) => Rafael David Tinoco (rafaeldtinoco) ** Changed in: systemd (Ubuntu Eoan) Assignee: (unassigned) => Rafael David Tinoco (rafaeldtinoco) ** Changed in: netplan Assignee: (unassigned) => Rafael David Tinoco (rafaeldtinoco) ** Changed in: systemd (Ubuntu Eoan) Status: Confirmed => In Progress ** Changed in: keepalived (Ubuntu Eoan) Status: Confirmed => In Progress -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1815101 Title: [master] Restarting systemd-networkd breaks keepalived clusters To manage notifications about this bug go to: https://bugs.launchpad.net/netplan/+bug/1815101/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1815101] Re: [master] Restarting systemd-networkd breaks keepalived clusters
The following 3 bugs: https://bugs.launchpad.net/bugs/1815101 https://bugs.launchpad.net/bugs/1819074 https://bugs.launchpad.net/bugs/1810583 Have the same root cause: the fact that systemd-network messes with secondary IP addresses in NICs managed by systemd. I'm marking all other cases as a duplicate of LP: #1815101. TODO here is the following: - There are mainly 2 "fixes" for this issue: 1) keepalived is able to recognize systemd-networkd changes and change cluster status in order to reconfigure managed NICs (keepalived (> 2.0.x)). 2) systemd-networkd implements a new stanza (KeepConfiguration=) to systemd service unit files in order to fix not only this behavior but all those HA related software that manages secondary IPs and/or aliases to NICs being managed by systemd-networkd. I think the most appropriate would make sure those 2 features work in Eoan, both, together, and then make sure the SRUs are done to Disco and Bionic. One problem w/ the item (2) is that netplan will also have to support the new "KeepConfiguration=" systemd service file stanza, but, the fix (2) is more appropriate for all other HA related softwares controlling virtual IPs (CTDB, Pacemaker, and so ...). -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1815101 Title: [master] Restarting systemd-networkd breaks keepalived clusters To manage notifications about this bug go to: https://bugs.launchpad.net/netplan/+bug/1815101/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1815101] Re: [master] Restarting systemd-networkd breaks keepalived clusters
The aforementioned link shows there's been work towards a fix in systemd. Can't say if that suggests what can be done to improve keepalived, but I've tagged this "server-next" to get it on the Ubuntu SErver Team's high priority list, as per Robie's earlier comment. ** Tags added: server-next -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1815101 Title: [master] Restarting systemd-networkd breaks keepalived clusters To manage notifications about this bug go to: https://bugs.launchpad.net/netplan/+bug/1815101/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1815101] Re: [master] Restarting systemd-networkd breaks keepalived clusters
For reference: https://github.com/systemd/systemd/pull/12511 -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1815101 Title: [master] Restarting systemd-networkd breaks keepalived clusters To manage notifications about this bug go to: https://bugs.launchpad.net/netplan/+bug/1815101/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1815101] Re: [master] Restarting systemd-networkd breaks keepalived clusters
It looks like there is some clear and actionable work in keepalived here (even if as a workaround and the real fix ends up being in systemd), so I'm marking it as Triaged. FTR, the Ubuntu Server Team is aware of this as a high level issue and it is high up in our list of priorities to determine how to address it properly. ** Changed in: keepalived (Ubuntu) Status: Incomplete => Triaged -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1815101 Title: [master] Restarting systemd-networkd breaks keepalived clusters To manage notifications about this bug go to: https://bugs.launchpad.net/netplan/+bug/1815101/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1815101] Re: [master] Restarting systemd-networkd breaks keepalived clusters
If I understand the keepalived > 2.0.x behavior referred to by cdmiller above (see 2019-03-07 comment) that is not the appropriate response to the problem. Granted, it mitigates the consequences butr doesn't address the underlying issue. A systemd-source issue should not cause keepalived failover since failover is designed to address issues of system or hardware failure, not the bad behavior of other system software. systemd needs to be made to cooperate with other software rather than assuming it is the only authority on the system. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1815101 Title: [master] Restarting systemd-networkd breaks keepalived clusters To manage notifications about this bug go to: https://bugs.launchpad.net/netplan/+bug/1815101/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs