Good Morning, I’m curious if anyone is successfully running openshift in an environment where they manage their own dhcp clients and scopes. Our infrastructure recently had an issue and we are struggling to find a root cause. In our environment we run two vip-manager POD’s which manages 2 ip addresses.
One of our suspicions has led us to believe that keepalived doesn’t play nice with dhcp. As an example, if the dhcp client dies or renews it’s IP address, the vip-manager POD recognizes this event. He logs the VIP he’s managing as well as the IP assigned to the node is removed, however, keepalived continues to send out the VRRP’s as if he’s still MASTER for that IP. This puts us in a bad spot, as the BACKUP keepalived never takes this IP address over and this IP is no longer assigned to anything. Here’s example log output from the POD that I forced this failure: 10.0.0.1 == address assigned to node via DHCP 10.0.0.2 == address assigned to vip_manager_VIP_1 10.0.0.3 == address assigned to vip_manager_VIP_2 10.1.4.1 == lbr0/tun0 - Loading ip_vs module ... - Checking if ip_vs module is available ... ip_vs 140944 0 - Module ip_vs is loaded. - Generating and writing config to /etc/keepalived/keepalived.conf - Starting failover services ... Starting Healthcheck child process, pid=136 Initializing ipvs 2.6 Starting VRRP child process, pid=137 Netlink reflector reports IP 10.0.0.1 added Netlink reflector reports IP 10.0.0.1 added Netlink reflector reports IP 10.1.4.1 added Netlink reflector reports IP 10.1.4.1 added Netlink reflector reports IP 10.1.4.1 added Netlink reflector reports IP 10.1.4.1 added Registering Kernel netlink reflector Registering Kernel netlink reflector Registering Kernel netlink command channel Registering Kernel netlink command channel Registering gratuitous ARP shared channel Opening file '/etc/keepalived/keepalived.conf'. Opening file '/etc/keepalived/keepalived.conf'. Configuration is using : 8733 Bytes Truncating auth_pass to 8 characters Truncating auth_pass to 8 characters Configuration is using : 73522 Bytes Using LinkWatch kernel netlink reflector... VRRP_Instance(vip_manager_VIP_1) Entering BACKUP STATE VRRP sockpool: [ifindex(2), proto(112), unicast(0), fd(9,10)] VRRP_Instance(vip_manager_VIP_2) Transition to MASTER STATE VRRP_Instance(vip_manager_VIP_2) Entering FAULT STATE VRRP_Script(chk_vip_manager) succeeded VRRP_Instance(vip_manager_VIP_2) prio is higher than received advert VRRP_Instance(vip_manager_VIP_2) Transition to MASTER STATE VRRP_Instance(vip_manager_VIP_2) Received lower prio advert, forcing new election VRRP_Instance(vip_manager_VIP_2) Entering MASTER STATE VRRP_Instance(vip_manager_VIP_2) setting protocol VIPs. Netlink reflector reports IP 10.0.0.3 added VRRP_Instance(vip_manager_VIP_2) Sending gratuitous ARPs on eno16780032 for 10.0.0.3 VRRP_Instance(vip_manager_VIP_2) Sending gratuitous ARPs on eno16780032 for 10.0.0.3 ...<dhclient renews the ip address>... Netlink reflector reports IP 10.0.0.1 removed Netlink reflector reports IP 10.0.0.1 removed Netlink reflector reports IP 10.0.0.3 removed Netlink reflector reports IP 10.0.0.3 removed Netlink reflector reports IP 10.0.0.1 added Netlink reflector reports IP 10.0.0.1 added And the other vip-manager pod is still receiving VRRP’s for 10.0.0.3, therefore never takes over this IP address, so effectively half of the traffic (pending DNS round-robin) is being lost at this point. Our recovery option at this point is to restart the network, which would stop the VRRP packets long enough to cause a failover, or restart the effected POD. The version of keepalived provided by RHEL is 10 minor revisions behind, I’m curious if there may be a benefit to getting this package updated. Pending any advice from anyone my next step in troubleshooting this would be to go about building my own version of the vip-manager with an upgraded version of keepalived to see if this issue continues. -- John Skarbek
_______________________________________________ users mailing list [email protected] http://lists.openshift.redhat.com/openshiftmm/listinfo/users
