** Description changed: - Latest Mitaka code, L3 HA - After running rally create_and_delete_routers (concurrency 100 and times 100, or more) neutron l3 agent logs on nodes filled (every .003 second timestamp) with such traces: - http://paste.openstack.org/show/599851/ - which causes cluster fall when log partition will filled up. + [Impact] + + When deleting a router the logfile is filled up. See full log - + http://paste.ubuntu.com/25429257/ + + I can see the error 'Error while deleting router + c0dab368-5ac8-4996-88c9-f5d345a774a6' occured 3343386 times from + _safe_router_removed() [1]: + + $ grep -r 'Error while deleting router c0dab368-5ac8-4996-88c9-f5d345a774a6' |wc -l + 3343386 + + This _safe_router_removed() is invoked by L488 [2], if + _safe_router_removed() goes wrong it will return False, then + self._resync_router(update) [3] will make the code _safe_router_removed + be run again and again. So we saw so many errors 'Error while deleting + router XXXXX'. + + [1] https://github.com/openstack/neutron/blob/mitaka-eol/neutron/agent/l3/agent.py#L361 + [2] https://github.com/openstack/neutron/blob/mitaka-eol/neutron/agent/l3/agent.py#L488 + [3] https://github.com/openstack/neutron/blob/mitaka-eol/neutron/agent/l3/agent.py#L457 + + [Test Case] + + That's because race condition between neutron server and L3 agent, after + neutron server deletes HA interfaces the L3 agent may sync a HA router + without HA interface info (just need to trigger L708[1] after deleting + HA interfaces and before deleting HA router). If we delete HA router at + this time, this problem will happen. So test case we design is as below: + + 1, Create ha_router + + neutron router-create harouter --ha=True + + 2, Delete ports associated with ha_router before deleting ha_router + + neutron router-port-list harouter |grep 'HA port' |awk '{print $2}' |xargs -l neutron port-delete + neutron router-port-list harouter + + 3, Update ha_router to trigger l3-agent to update ha_router info without + ha_port into self.router_info + + neutron router-update harouter --description=test + + 4, Delete ha_router this time + + neutron router-delete harouter + + [1] https://github.com/openstack/neutron/blob/mitaka- + eol/neutron/db/l3_hamode_db.py#L708 + + [Regression Potential] + + The fixed patch [1] will no longer return ha_router which is missing + ha_ports, so L488 will no longer have chance to call + _safe_router_removed() for a ha_router, so the problem has been + fundamentally fixed by this patch and no regression potential. + + Besides, this fixed patch has been in mitaka-eol branch now, and + neutron-server mitaka package is based on neutron-8.4.0, so we need to + backport it to xenial and mitaka. + + $ git tag --contains 8c77ee6b20dd38cc0246e854711cb91cffe3a069 + mitaka-eol + + [1] https://review.openstack.org/#/c/440799/2/neutron/db/l3_hamode_db.py + [2] https://github.com/openstack/neutron/blob/mitaka-eol/neutron/agent/l3/agent.py#L488
** Summary changed: - Infinite loop trying to delete deleted HA router + [SRU] Infinite loop trying to delete deleted HA router ** Description changed: [Impact] When deleting a router the logfile is filled up. See full log - http://paste.ubuntu.com/25429257/ I can see the error 'Error while deleting router c0dab368-5ac8-4996-88c9-f5d345a774a6' occured 3343386 times from _safe_router_removed() [1]: $ grep -r 'Error while deleting router c0dab368-5ac8-4996-88c9-f5d345a774a6' |wc -l 3343386 This _safe_router_removed() is invoked by L488 [2], if _safe_router_removed() goes wrong it will return False, then self._resync_router(update) [3] will make the code _safe_router_removed be run again and again. So we saw so many errors 'Error while deleting router XXXXX'. [1] https://github.com/openstack/neutron/blob/mitaka-eol/neutron/agent/l3/agent.py#L361 [2] https://github.com/openstack/neutron/blob/mitaka-eol/neutron/agent/l3/agent.py#L488 [3] https://github.com/openstack/neutron/blob/mitaka-eol/neutron/agent/l3/agent.py#L457 [Test Case] That's because race condition between neutron server and L3 agent, after neutron server deletes HA interfaces the L3 agent may sync a HA router without HA interface info (just need to trigger L708[1] after deleting HA interfaces and before deleting HA router). If we delete HA router at this time, this problem will happen. So test case we design is as below: 1, Create ha_router neutron router-create harouter --ha=True 2, Delete ports associated with ha_router before deleting ha_router neutron router-port-list harouter |grep 'HA port' |awk '{print $2}' |xargs -l neutron port-delete neutron router-port-list harouter 3, Update ha_router to trigger l3-agent to update ha_router info without ha_port into self.router_info neutron router-update harouter --description=test 4, Delete ha_router this time neutron router-delete harouter [1] https://github.com/openstack/neutron/blob/mitaka- eol/neutron/db/l3_hamode_db.py#L708 [Regression Potential] - The fixed patch [1] will no longer return ha_router which is missing - ha_ports, so L488 will no longer have chance to call + The fixed patch [1] for neutron-server will no longer return ha_router + which is missing ha_ports, so L488 will no longer have chance to call _safe_router_removed() for a ha_router, so the problem has been fundamentally fixed by this patch and no regression potential. Besides, this fixed patch has been in mitaka-eol branch now, and neutron-server mitaka package is based on neutron-8.4.0, so we need to backport it to xenial and mitaka. $ git tag --contains 8c77ee6b20dd38cc0246e854711cb91cffe3a069 mitaka-eol [1] https://review.openstack.org/#/c/440799/2/neutron/db/l3_hamode_db.py [2] https://github.com/openstack/neutron/blob/mitaka-eol/neutron/agent/l3/agent.py#L488 ** Tags added: sts sts-sru-needed ** Patch added: "mitaka.debdiff" https://bugs.launchpad.net/neutron/+bug/1668410/+attachment/4941145/+files/mitaka.debdiff -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1668410 Title: [SRU] Infinite loop trying to delete deleted HA router To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1668410/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
