Public bug reported:

We found the issue when a created HA DVR router gets stuck in the backup state 
and does not go into primary state.
Preconditions:
1) there is no router with a specific external network yet
2) the router needs to go through a quick creation->deletion, and then the next 
creation of the router can get stuck in the backup state

The reason for such behavior is not removed fip-ns on the agent while the 
floatingip_agent_gateway port was removed.
Further is a demo with the help of which I managed to reproduce this behavior 
on a single node devstack setup with.

Сreate a router and quickly delete it while the l3 agent processes the
external GW adding:

[root@devstack ~]# r_id=$(openstack router create r1 --distributed --ha -c id 
-f value); sleep 30 # give time to process
[root@devstack ~]# count_fip_requests() { journalctl -u [email protected] | 
grep 'FloatingIP agent gateway port received' | wc -l; }
[root@devstack ~]# # add an external gateway and then delete the router while 
the agent processes gw
[root@devstack ~]# fip_requests=$(count_fip_requests); openstack router set 
$r_id --external-gateway public; while :; do [[ $fip_requests == 
$(count_fip_requests) ]] && { echo "waiting before deletion..."; sleep 1; } || 
break; done; openstack router delete $r_id
waiting before deletion...
waiting before deletion...
[root@devstack ~]#

As a result fip-ns is not deleted even though the
floatingip_agent_gateway port was removed:

[root@devstack ~]# ip netns
fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f (id: 1)
[root@devstack ~]# openstack port list --network public -c ID -c device_owner 
-c status --long
<empty>
[root@devstack ~]#

Re-create the router together with external gw now:

[root@devstack ~]# openstack router create r1 --ha --distributed
--external-gateway public

In the logs, one can see a traceback that the creation of this router
failed initially, followed by a successful creation:

ERROR neutron.agent.l3.dvr_fip_ns Traceback (most recent call last):
ERROR neutron.agent.l3.dvr_fip_ns   File 
"/opt/stack/neutron/neutron/agent/l3/dvr_fip_ns.py", line 152, in 
create_or_update_gateway_port
ERROR neutron.agent.l3.dvr_fip_ns     self._update_gateway_port(
ERROR neutron.agent.l3.dvr_fip_ns   File 
"/opt/stack/neutron/neutron/agent/l3/dvr_fip_ns.py", line 323, in 
_update_gateway_port
ERROR neutron.agent.l3.dvr_fip_ns     self.driver.set_onlink_routes(
ERROR neutron.agent.l3.dvr_fip_ns   File 
"/opt/stack/neutron/neutron/agent/linux/interface.py", line 193, in 
set_onlink_routes
ERROR neutron.agent.l3.dvr_fip_ns     onlink = 
device.route.list_onlink_routes(constants.IP_VERSION_4)
ERROR neutron.agent.l3.dvr_fip_ns   File 
"/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 633, in 
list_onlink_routes
ERROR neutron.agent.l3.dvr_fip_ns     routes = self.list_routes(ip_version, 
scope='link')
ERROR neutron.agent.l3.dvr_fip_ns   File 
"/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 629, in list_routes
ERROR neutron.agent.l3.dvr_fip_ns     return 
list_ip_routes(self._parent.namespace, ip_version, scope=scope,
ERROR neutron.agent.l3.dvr_fip_ns   File 
"/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 1585, in list_ip_routes
ERROR neutron.agent.l3.dvr_fip_ns     routes = 
privileged.list_ip_routes(namespace, ip_version, device=device,
ERROR neutron.agent.l3.dvr_fip_ns   File 
"/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 333, in 
wrapped_f
ERROR neutron.agent.l3.dvr_fip_ns     return self(f, *args, **kw)
ERROR neutron.agent.l3.dvr_fip_ns   File 
"/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 423, in 
__call__
ERROR neutron.agent.l3.dvr_fip_ns     do = self.iter(retry_state=retry_state)
ERROR neutron.agent.l3.dvr_fip_ns   File 
"/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 360, in iter
ERROR neutron.agent.l3.dvr_fip_ns     return fut.result()
ERROR neutron.agent.l3.dvr_fip_ns   File 
"/usr/lib64/python3.9/concurrent/futures/_base.py", line 439, in result
ERROR neutron.agent.l3.dvr_fip_ns     return self.__get_result()
ERROR neutron.agent.l3.dvr_fip_ns   File 
"/usr/lib64/python3.9/concurrent/futures/_base.py", line 391, in __get_result
ERROR neutron.agent.l3.dvr_fip_ns     raise self._exception
ERROR neutron.agent.l3.dvr_fip_ns   File 
"/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 426, in 
__call__
ERROR neutron.agent.l3.dvr_fip_ns     result = fn(*args, **kwargs)
ERROR neutron.agent.l3.dvr_fip_ns   File 
"/usr/local/lib/python3.9/site-packages/oslo_privsep/priv_context.py", line 
271, in _wrap
ERROR neutron.agent.l3.dvr_fip_ns     return self.channel.remote_call(name, 
args, kwargs,
ERROR neutron.agent.l3.dvr_fip_ns   File 
"/usr/local/lib/python3.9/site-packages/oslo_privsep/daemon.py", line 215, in 
remote_call
ERROR neutron.agent.l3.dvr_fip_ns     raise exc_type(*result[2])
ERROR neutron.agent.l3.dvr_fip_ns 
neutron.privileged.agent.linux.ip_lib.NetworkInterfaceNotFound: Network 
interface fg-b489f216-35not found in namespace 
fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f.


The result is the following state:

[root@devstack ~]# ip netns
fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f (id: 2)
qrouter-1f384e52-533c-49ed-b809-71f6358a2e5b
snat-1f384e52-533c-49ed-b809-71f6358a2e5b (id: 1)
[root@devstack ~]# openstack port list --network public -c ID -c device_owner 
-c status --long
+--------------------------------------+----------------------------------+--------+
| ID                                   | Device Owner                     | 
Status |
+--------------------------------------+----------------------------------+--------+
| 17679644-d775-4182-b5b3-f2035e6483d9 | network:router_gateway           | 
DOWN   |
| b489f216-356a-456a-82ab-849e43a3226d | network:floatingip_agent_gateway | 
ACTIVE |
+--------------------------------------+----------------------------------+--------+
[root@devstack ~]#
[root@devstack ~]# cat 
/opt/stack/data/neutron/ha_confs/1f384e52-533c-49ed-b809-71f6358a2e5b/state
backup
[root@devstack ~]# stat 
/opt/stack/data/neutron/ha_confs/1f384e52-533c-49ed-b809-71f6358a2e5b/neutron-keepalived-state-change.log
...
Access: 2023-01-19 11:10:10.715245690 -0500
Modify: 2023-01-19 11:10:18.976208238 -0500
Change: 2023-01-19 11:10:18.976208238 -0500
 Birth: 2023-01-19 11:10:10.715245690 -0500
[root@devstack ~]# stat /var/run/netns/fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f
...
Access: 2023-01-19 11:10:19.533205713 -0500
Modify: 2023-01-19 11:10:19.533205713 -0500
Change: 2023-01-19 11:10:19.533205713 -0500
 Birth: -
[root@devstack ~]#

By timestamp we can see that a keepalived monitoring started to work before the 
fip-ns was re-created after unsuccessful first attempt to create a router.
So keepalived monitoring is still bound to the FIP-ns that was created on the 
previously stuck namespace.

Adding an external gw and removing a router has a race condition and
it's not always possible to get 100% reproduction. To achieve 100%
reproduction, just add a small sleep with the following patch:

[root@devstack neutron]# git diff
diff --git a/neutron/agent/l3/dvr_local_router.py 
b/neutron/agent/l3/dvr_local_router.py
index 6e37c09511..d01eb0de9b 100644
--- a/neutron/agent/l3/dvr_local_router.py
+++ b/neutron/agent/l3/dvr_local_router.py
@@ -837,6 +837,8 @@ class DvrLocalRouter(dvr_router_base.DvrRouterBase):
                 self.agent.context, ex_gw_port['network_id'])
             LOG.debug("FloatingIP agent gateway port received from the "
                       "plugin: %s", fip_agent_port)
+            import time
+            time.sleep(5)
         self.fip_ns.create_or_update_gateway_port(fip_agent_port)

     def update_routing_table(self, operation, route):
[root@devstack neutron]#

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2003359

Title:
  DVR HA router gets stuck in backup state

Status in neutron:
  New

Bug description:
  We found the issue when a created HA DVR router gets stuck in the backup 
state and does not go into primary state.
  Preconditions:
  1) there is no router with a specific external network yet
  2) the router needs to go through a quick creation->deletion, and then the 
next creation of the router can get stuck in the backup state

  The reason for such behavior is not removed fip-ns on the agent while the 
floatingip_agent_gateway port was removed.
  Further is a demo with the help of which I managed to reproduce this behavior 
on a single node devstack setup with.

  Сreate a router and quickly delete it while the l3 agent processes the
  external GW adding:

  [root@devstack ~]# r_id=$(openstack router create r1 --distributed --ha -c id 
-f value); sleep 30 # give time to process
  [root@devstack ~]# count_fip_requests() { journalctl -u [email protected] 
| grep 'FloatingIP agent gateway port received' | wc -l; }
  [root@devstack ~]# # add an external gateway and then delete the router while 
the agent processes gw
  [root@devstack ~]# fip_requests=$(count_fip_requests); openstack router set 
$r_id --external-gateway public; while :; do [[ $fip_requests == 
$(count_fip_requests) ]] && { echo "waiting before deletion..."; sleep 1; } || 
break; done; openstack router delete $r_id
  waiting before deletion...
  waiting before deletion...
  [root@devstack ~]#

  As a result fip-ns is not deleted even though the
  floatingip_agent_gateway port was removed:

  [root@devstack ~]# ip netns
  fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f (id: 1)
  [root@devstack ~]# openstack port list --network public -c ID -c device_owner 
-c status --long
  <empty>
  [root@devstack ~]#

  Re-create the router together with external gw now:

  [root@devstack ~]# openstack router create r1 --ha --distributed
  --external-gateway public

  In the logs, one can see a traceback that the creation of this router
  failed initially, followed by a successful creation:

  ERROR neutron.agent.l3.dvr_fip_ns Traceback (most recent call last):
  ERROR neutron.agent.l3.dvr_fip_ns   File 
"/opt/stack/neutron/neutron/agent/l3/dvr_fip_ns.py", line 152, in 
create_or_update_gateway_port
  ERROR neutron.agent.l3.dvr_fip_ns     self._update_gateway_port(
  ERROR neutron.agent.l3.dvr_fip_ns   File 
"/opt/stack/neutron/neutron/agent/l3/dvr_fip_ns.py", line 323, in 
_update_gateway_port
  ERROR neutron.agent.l3.dvr_fip_ns     self.driver.set_onlink_routes(
  ERROR neutron.agent.l3.dvr_fip_ns   File 
"/opt/stack/neutron/neutron/agent/linux/interface.py", line 193, in 
set_onlink_routes
  ERROR neutron.agent.l3.dvr_fip_ns     onlink = 
device.route.list_onlink_routes(constants.IP_VERSION_4)
  ERROR neutron.agent.l3.dvr_fip_ns   File 
"/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 633, in 
list_onlink_routes
  ERROR neutron.agent.l3.dvr_fip_ns     routes = self.list_routes(ip_version, 
scope='link')
  ERROR neutron.agent.l3.dvr_fip_ns   File 
"/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 629, in list_routes
  ERROR neutron.agent.l3.dvr_fip_ns     return 
list_ip_routes(self._parent.namespace, ip_version, scope=scope,
  ERROR neutron.agent.l3.dvr_fip_ns   File 
"/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 1585, in list_ip_routes
  ERROR neutron.agent.l3.dvr_fip_ns     routes = 
privileged.list_ip_routes(namespace, ip_version, device=device,
  ERROR neutron.agent.l3.dvr_fip_ns   File 
"/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 333, in 
wrapped_f
  ERROR neutron.agent.l3.dvr_fip_ns     return self(f, *args, **kw)
  ERROR neutron.agent.l3.dvr_fip_ns   File 
"/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 423, in 
__call__
  ERROR neutron.agent.l3.dvr_fip_ns     do = self.iter(retry_state=retry_state)
  ERROR neutron.agent.l3.dvr_fip_ns   File 
"/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 360, in iter
  ERROR neutron.agent.l3.dvr_fip_ns     return fut.result()
  ERROR neutron.agent.l3.dvr_fip_ns   File 
"/usr/lib64/python3.9/concurrent/futures/_base.py", line 439, in result
  ERROR neutron.agent.l3.dvr_fip_ns     return self.__get_result()
  ERROR neutron.agent.l3.dvr_fip_ns   File 
"/usr/lib64/python3.9/concurrent/futures/_base.py", line 391, in __get_result
  ERROR neutron.agent.l3.dvr_fip_ns     raise self._exception
  ERROR neutron.agent.l3.dvr_fip_ns   File 
"/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 426, in 
__call__
  ERROR neutron.agent.l3.dvr_fip_ns     result = fn(*args, **kwargs)
  ERROR neutron.agent.l3.dvr_fip_ns   File 
"/usr/local/lib/python3.9/site-packages/oslo_privsep/priv_context.py", line 
271, in _wrap
  ERROR neutron.agent.l3.dvr_fip_ns     return self.channel.remote_call(name, 
args, kwargs,
  ERROR neutron.agent.l3.dvr_fip_ns   File 
"/usr/local/lib/python3.9/site-packages/oslo_privsep/daemon.py", line 215, in 
remote_call
  ERROR neutron.agent.l3.dvr_fip_ns     raise exc_type(*result[2])
  ERROR neutron.agent.l3.dvr_fip_ns 
neutron.privileged.agent.linux.ip_lib.NetworkInterfaceNotFound: Network 
interface fg-b489f216-35not found in namespace 
fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f.

  
  The result is the following state:

  [root@devstack ~]# ip netns
  fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f (id: 2)
  qrouter-1f384e52-533c-49ed-b809-71f6358a2e5b
  snat-1f384e52-533c-49ed-b809-71f6358a2e5b (id: 1)
  [root@devstack ~]# openstack port list --network public -c ID -c device_owner 
-c status --long
  
+--------------------------------------+----------------------------------+--------+
  | ID                                   | Device Owner                     | 
Status |
  
+--------------------------------------+----------------------------------+--------+
  | 17679644-d775-4182-b5b3-f2035e6483d9 | network:router_gateway           | 
DOWN   |
  | b489f216-356a-456a-82ab-849e43a3226d | network:floatingip_agent_gateway | 
ACTIVE |
  
+--------------------------------------+----------------------------------+--------+
  [root@devstack ~]#
  [root@devstack ~]# cat 
/opt/stack/data/neutron/ha_confs/1f384e52-533c-49ed-b809-71f6358a2e5b/state
  backup
  [root@devstack ~]# stat 
/opt/stack/data/neutron/ha_confs/1f384e52-533c-49ed-b809-71f6358a2e5b/neutron-keepalived-state-change.log
  ...
  Access: 2023-01-19 11:10:10.715245690 -0500
  Modify: 2023-01-19 11:10:18.976208238 -0500
  Change: 2023-01-19 11:10:18.976208238 -0500
   Birth: 2023-01-19 11:10:10.715245690 -0500
  [root@devstack ~]# stat 
/var/run/netns/fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f
  ...
  Access: 2023-01-19 11:10:19.533205713 -0500
  Modify: 2023-01-19 11:10:19.533205713 -0500
  Change: 2023-01-19 11:10:19.533205713 -0500
   Birth: -
  [root@devstack ~]#

  By timestamp we can see that a keepalived monitoring started to work before 
the fip-ns was re-created after unsuccessful first attempt to create a router.
  So keepalived monitoring is still bound to the FIP-ns that was created on the 
previously stuck namespace.

  Adding an external gw and removing a router has a race condition and
  it's not always possible to get 100% reproduction. To achieve 100%
  reproduction, just add a small sleep with the following patch:

  [root@devstack neutron]# git diff
  diff --git a/neutron/agent/l3/dvr_local_router.py 
b/neutron/agent/l3/dvr_local_router.py
  index 6e37c09511..d01eb0de9b 100644
  --- a/neutron/agent/l3/dvr_local_router.py
  +++ b/neutron/agent/l3/dvr_local_router.py
  @@ -837,6 +837,8 @@ class DvrLocalRouter(dvr_router_base.DvrRouterBase):
                   self.agent.context, ex_gw_port['network_id'])
               LOG.debug("FloatingIP agent gateway port received from the "
                         "plugin: %s", fip_agent_port)
  +            import time
  +            time.sleep(5)
           self.fip_ns.create_or_update_gateway_port(fip_agent_port)

       def update_routing_table(self, operation, route):
  [root@devstack neutron]#

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2003359/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to