Alright, As this is a problem that does not only affect keepalived, but, all cluster-like softwares dealing with aliases in any existing interface, managed or not by systemd, I have tested the same test case in a pacemaker based cluster, with 3 nodes, having 1 virtual IP + a lighttpd instance running in the same resource group:
---- (k)inaddy@kcluster01:~$ crm config show node 1: kcluster01 node 2: kcluster02 node 3: kcluster03 primitive fence_kcluster01 stonith:fence_virsh \ params ipaddr=192.168.100.205 plug=kcluster01 action=off login=stonithmgr passwd=xxxx use_sudo=true delay=2 \ op monitor interval=60s primitive fence_kcluster02 stonith:fence_virsh \ params ipaddr=192.168.100.205 plug=kcluster02 action=off login=stonithmgr passwd=xxxx use_sudo=true delay=4 \ op monitor interval=60s primitive fence_kcluster03 stonith:fence_virsh \ params ipaddr=192.168.100.205 plug=kcluster03 action=off login=stonithmgr passwd=xxxx use_sudo=true delay=6 \ op monitor interval=60s primitive virtual_ip IPaddr2 \ params ip=10.0.3.1 nic=eth3 \ op monitor interval=10s primitive webserver systemd:lighttpd \ op monitor interval=10 timeout=60 group webserver_virtual_ip webserver virtual_ip location l_fence_kcluster01 fence_kcluster01 -inf: kcluster01 location l_fence_kcluster02 fence_kcluster02 -inf: kcluster02 location l_fence_kcluster03 fence_kcluster03 -inf: kcluster03 property cib-bootstrap-options: \ have-watchdog=true \ dc-version=2.0.1-9e909a5bdd \ cluster-infrastructure=corosync \ cluster-name=debian \ stonith-enabled=true \ stonith-action=off \ no-quorum-policy=stop ---- (k)inaddy@kcluster01:~$ cat /etc/netplan/cluster.yaml network: version: 2 renderer: networkd ethernets: eth1: dhcp4: no dhcp6: no addresses: [10.0.1.2/24] eth2: dhcp4: no dhcp6: no addresses: [10.0.2.2/24] eth3: dhcp4: no dhcp6: no addresses: [10.0.3.2/24] eth4: dhcp4: no dhcp6: no addresses: [10.0.4.2/24] eth5: dhcp4: no dhcp6: no addresses: [10.0.5.2/24] ---- AND the virtual IP failed right after the netplan acted in systemd network interface. (k)inaddy@kcluster03:~$ sudo netplan apply (k)inaddy@kcluster03:~$ ping 10.0.3.1 PING 10.0.3.1 (10.0.3.1) 56(84) bytes of data. >From 10.0.3.4 icmp_seq=1 Destination Host Unreachable >From 10.0.3.4 icmp_seq=2 Destination Host Unreachable >From 10.0.3.4 icmp_seq=3 Destination Host Unreachable >From 10.0.3.4 icmp_seq=4 Destination Host Unreachable >From 10.0.3.4 icmp_seq=5 Destination Host Unreachable >From 10.0.3.4 icmp_seq=6 Destination Host Unreachable 64 bytes from 10.0.3.1: icmp_seq=7 ttl=64 time=0.088 ms 64 bytes from 10.0.3.1: icmp_seq=8 ttl=64 time=0.076 ms --- 10.0.3.1 ping statistics --- 8 packets transmitted, 2 received, +6 errors, 75% packet loss, time 7128ms rtt min/avg/max/mdev = 0.076/0.082/0.088/0.006 ms, pipe 4 Liked explained in this bug description. With that, virtual_ip_monitor, from pacemaker, realized the virtual IP was gone and re-started it in the same node: ---- (k)inaddy@kcluster01:~$ crm status Stack: corosync Current DC: kcluster01 (version 2.0.1-9e909a5bdd) - partition with quorum Last updated: Wed Sep 25 13:11:05 2019 Last change: Wed Sep 25 12:49:56 2019 by root via cibadmin on kcluster01 3 nodes configured 5 resources configured Online: [ kcluster01 kcluster02 kcluster03 ] Full list of resources: fence_kcluster01 (stonith:fence_virsh): Started kcluster02 fence_kcluster02 (stonith:fence_virsh): Started kcluster01 fence_kcluster03 (stonith:fence_virsh): Started kcluster01 Resource Group: webserver_virtual_ip webserver (systemd:lighttpd): Started kcluster03 virtual_ip (ocf::heartbeat:IPaddr2): FAILED kcluster03 Failed Resource Actions: * virtual_ip_monitor_10000 on kcluster03 'not running' (7): call=100, status=complete, exitreason='', last-rc-change='Wed Sep 25 13:11:05 2019', queued=0ms, exec=0ms ---- (k)inaddy@kcluster01:~$ crm status Stack: corosync Current DC: kcluster01 (version 2.0.1-9e909a5bdd) - partition with quorum Last updated: Wed Sep 25 13:11:07 2019 Last change: Wed Sep 25 12:49:56 2019 by root via cibadmin on kcluster01 3 nodes configured 5 resources configured Online: [ kcluster01 kcluster02 kcluster03 ] Full list of resources: fence_kcluster01 (stonith:fence_virsh): Started kcluster02 fence_kcluster02 (stonith:fence_virsh): Started kcluster01 fence_kcluster03 (stonith:fence_virsh): Started kcluster01 Resource Group: webserver_virtual_ip webserver (systemd:lighttpd): Started kcluster03 virtual_ip (ocf::heartbeat:IPaddr2): Started kcluster03 Failed Resource Actions: * virtual_ip_monitor_10000 on kcluster03 'not running' (7): call=100, status=complete, exitreason='', last-rc-change='Wed Sep 25 13:11:05 2019', queued=0ms, exec=0ms ---- And, if I want, I can query the number of restarts that particular resource (the virtual_ip monitor) had in that node, to check if the resource was about to migrate to another node, thinking this was a real failure (and it is ?): (k)inaddy@kcluster01:~$ sudo crm_failcount --query -r virtual_ip -N kcluster03 scope=status name=fail-count-virtual_ip value=5 So this resource already failed 5 times in that node, and a "netplan apply" could have migrated the issue, for example. ---- For pacemaker, the issue is not *that big* if the cluster is configured correctly - with a resource monitor - as the cluster will always try to restart the virtual IP associated with the resource - lighttpd in my case - being managed. Nevertheless, resource migrations and possible downtime could happen in the event of multiple resource monitor failures. I'll check now why keepalived can't simply re-establish the virtual IPs in the event of a failure, like pacemaker does, and, if systemd-networkd should be altered not to change aliases if having a specific flag, or things are good the way they are. -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to systemd in Ubuntu. https://bugs.launchpad.net/bugs/1815101 Title: [master] Restarting systemd-networkd breaks keepalived clusters Status in netplan: Confirmed Status in heartbeat package in Ubuntu: Triaged Status in keepalived package in Ubuntu: In Progress Status in systemd package in Ubuntu: In Progress Status in heartbeat source package in Bionic: Triaged Status in keepalived source package in Bionic: Confirmed Status in systemd source package in Bionic: Confirmed Status in heartbeat source package in Disco: Triaged Status in keepalived source package in Disco: Confirmed Status in systemd source package in Disco: Confirmed Status in heartbeat source package in Eoan: Triaged Status in keepalived source package in Eoan: In Progress Status in systemd source package in Eoan: In Progress Bug description: Configure netplan for interfaces, for example (a working config with IP addresses obfuscated) network: ethernets: eth0: addresses: [192.168.0.5/24] dhcp4: false nameservers: search: [blah.com, other.blah.com, hq.blah.com, cust.blah.com, phone.blah.com] addresses: [10.22.11.1] eth2: addresses: - 12.13.14.18/29 - 12.13.14.19/29 gateway4: 12.13.14.17 dhcp4: false nameservers: search: [blah.com, other.blah.com, hq.blah.com, cust.blah.com, phone.blah.com] addresses: [10.22.11.1] eth3: addresses: [10.22.11.6/24] dhcp4: false nameservers: search: [blah.com, other.blah.com, hq.blah.com, cust.blah.com, phone.blah.com] addresses: [10.22.11.1] eth4: addresses: [10.22.14.6/24] dhcp4: false nameservers: search: [blah.com, other.blah.com, hq.blah.com, cust.blah.com, phone.blah.com] addresses: [10.22.11.1] eth7: addresses: [9.5.17.34/29] dhcp4: false optional: true nameservers: search: [blah.com, other.blah.com, hq.blah.com, cust.blah.com, phone.blah.com] addresses: [10.22.11.1] version: 2 Configure keepalived (again, a working config with IP addresses obfuscated) global_defs # Block id { notification_email { sysadm...@blah.com } notification_email_from keepali...@system3.hq.blah.com smtp_server 10.22.11.7 # IP smtp_connect_timeout 30 # integer, seconds router_id system3 # string identifying the machine, # (doesn't have to be hostname). vrrp_mcast_group4 224.0.0.18 # optional, default 224.0.0.18 vrrp_mcast_group6 ff02::12 # optional, default ff02::12 enable_traps # enable SNMP traps } vrrp_sync_group collection { group { wan lan phone } vrrp_instance wan { state MASTER interface eth2 virtual_router_id 77 priority 150 advert_int 1 smtp_alert authentication { auth_type PASS auth_pass BlahBlah } virtual_ipaddress { 12.13.14.20 } } vrrp_instance lan { state MASTER interface eth3 virtual_router_id 78 priority 150 advert_int 1 smtp_alert authentication { auth_type PASS auth_pass MoreBlah } virtual_ipaddress { 10.22.11.13/24 } } vrrp_instance phone { state MASTER interface eth4 virtual_router_id 79 priority 150 advert_int 1 smtp_alert authentication { auth_type PASS auth_pass MostBlah } virtual_ipaddress { 10.22.14.3/24 } } At boot the affected interfaces have: 5: eth4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether ab:cd:ef:90:c0:e3 brd ff:ff:ff:ff:ff:ff inet 10.22.14.6/24 brd 10.22.14.255 scope global eth4 valid_lft forever preferred_lft forever inet 10.22.14.3/24 scope global secondary eth4 valid_lft forever preferred_lft forever inet6 fe80::ae1f:6bff:fe90:c0e3/64 scope link valid_lft forever preferred_lft forever 7: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether ab:cd:ef:b0:26:29 brd ff:ff:ff:ff:ff:ff inet 10.22.11.6/24 brd 10.22.11.255 scope global eth3 valid_lft forever preferred_lft forever inet 10.22.11.13/24 scope global secondary eth3 valid_lft forever preferred_lft forever inet6 fe80::ae1f:6bff:feb0:2629/64 scope link valid_lft forever preferred_lft forever 9: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether ab:cd:ef:b0:26:2b brd ff:ff:ff:ff:ff:ff inet 12.13.14.18/29 brd 12.13.14.23 scope global eth2 valid_lft forever preferred_lft forever inet 12.13.14.20/32 scope global eth2 valid_lft forever preferred_lft forever inet 12.33.89.19/29 brd 12.13.14.23 scope global secondary eth2 valid_lft forever preferred_lft forever inet6 fe80::ae1f:6bff:feb0:262b/64 scope link valid_lft forever preferred_lft forever Run 'netplan try' (didn't even make any changes to the configuration) and the keepalived addresses disappear never to return, the affected interfaces have: 5: eth4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether ab:cd:ef:90:c0:e3 brd ff:ff:ff:ff:ff:ff inet 10.22.14.6/24 brd 10.22.14.255 scope global eth4 valid_lft forever preferred_lft forever inet6 fe80::ae1f:6bff:fe90:c0e3/64 scope link valid_lft forever preferred_lft forever 7: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether ab:cd:ef:b0:26:29 brd ff:ff:ff:ff:ff:ff inet 10.22.11.6/24 brd 10.22.11.255 scope global eth3 valid_lft forever preferred_lft forever inet6 fe80::ae1f:6bff:feb0:2629/64 scope link valid_lft forever preferred_lft forever 9: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether ab:cd:ef:b0:26:2b brd ff:ff:ff:ff:ff:ff inet 12.13.14.18/29 brd 12.13.14.23 scope global eth2 valid_lft forever preferred_lft forever inet 12.33.89.19/29 brd 12.13.14.23 scope global secondary eth2 valid_lft forever preferred_lft forever inet6 fe80::ae1f:6bff:feb0:262b/64 scope link valid_lft forever preferred_lft forever To manage notifications about this bug go to: https://bugs.launchpad.net/netplan/+bug/1815101/+subscriptions -- Mailing list: https://launchpad.net/~touch-packages Post to : touch-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~touch-packages More help : https://help.launchpad.net/ListHelp