Reviewed: https://review.openstack.org/241517 Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=2d1b53bcfa6c4d6fa5bca2ba4da9aaca66245a5b Submitter: Jenkins Branch: master
commit 2d1b53bcfa6c4d6fa5bca2ba4da9aaca66245a5b Author: Hong Hui Xiao <[email protected]> Date: Wed Nov 4 01:44:43 2015 -0500 Kill the vrrp orphan process when (re)spawn keepalived When keepalived crashed unexpectedly, the vrrp process that it associates with will be orphan process. This will make the VIP unable to migrate to the router in the same host. Also, neutron code is not able to respawn the keepalived process, because keepalived thinks itself is still running, according to [1-3]. As a result, neutron will report respawning keepalived all the time. Restart l3-agent will not help. This patch will check and delete the orphan vrrp process if there is any, in the processmonitor of l3 agent. More details can be found in the bug description and comments. [1] https://goo.gl/W3GL9I [2] https://goo.gl/F0Ixfb [3] https://goo.gl/dUqhTo Change-Id: Ia1759ed1365b845d404686a8cd25f882cce35caf Closes-Bug: #1511311 ** Changed in: neutron Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1511311 Title: L3 agent failed to respawn keepalived process Status in neutron: Fix Released Bug description: I enabled the l3 ha in neutron configuration, and I usually see the following log in l3_agent.log: 2015-10-14 22:30:16.397 21460 ERROR neutron.agent.linux.external_process [-] default-service for router with uuid 59de181e-8f02-470d-80f6-cb9f0d46f78b not found. The process should not have died 2015-10-14 22:30:16.397 21460 ERROR neutron.agent.linux.external_process [-] respawning keepalived for uuid 59de181e-8f02-470d-80f6-cb9f0d46f78b 2015-10-14 22:30:16.397 21460 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/ha_confs/59de181e-8f02-470d-80f6-cb9f0d46f78b.pid get_value_from_file /usr/lib/python2.7/dist-packages/neutron/agent/linux/utils.py:222 2015-10-14 22:30:16.398 21460 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qrouter-59de181e-8f02-470d-80f6-cb9f0d46f78b', 'keepalived', '-P', '-f', '/var/lib/neutron/ha_confs/59de181e-8f02-470d-80f6-cb9f0d46f78b/keepalived.conf', '-p', '/var/lib/neutron/ha_confs/59de181e-8f02-470d-80f6-cb9f0d46f78b.pid', '-r', '/var/lib/neutron/ha_confs/59de181e-8f02-470d-80f6-cb9f0d46f78b.pid-vrrp'] create_process /usr/lib/python2.7/dist-packages/neutron/agent/linux/utils.py:84 And I noticed that the counts of vrrp pid files were usually bigger than the "pid" files: root@neutron2:~# ls /var/lib/neutron/ha_confs/ | grep pid | grep -v vrrp | wc -l 664 root@neutron2:~# ls /var/lib/neutron/ha_confs/ | grep vrrp | wc -l 677 And seems that if "pid.vrrp" file existed, we can't successfully respawn the keepalived process using this kind of command: keepalived -P -f /var/lib/neutron/ha_confs/cb01b1de-fa6c-461e-ba39-4d506dfdfccb/keepalived.conf -p /var/lib/neutron/ha_confs/cb01b1de-fa6c-461e-ba39-4d506dfdfccb.pid -r /var/lib/neutron/ha_confs/cb01b1de-fa6c-461e-ba39-4d506dfdfccb.pid-vrrp So I think in neutron, after we checked that the pid is not active, can we check the existence of "pid" file and "vrrp pid" file and remove them before respawn the keepalived process to make sure the process can be started successfully ? https://github.com/openstack/neutron/blob/master/neutron/agent/linux/external_process.py#L91-L92 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1511311/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : [email protected] Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp

