Public bug reported: Since a couple of weeks we have a problem in our production environment when restarting our l3-agent. (Our assumption is that this might has something to do with our upgrade to wallaby, as we never saw this problem on prior releases before.)
The l3 agent is hosting around 300 ha routers so when restarting the agent it takes a couple of seconds which results in the alive state to go down and therefore all active routers that were hosted on that agent flip to standby state. Now when the agent finished its startup it should set the correct active state for its routers again but fails for some random amount of routers. It does not log any exceptions or errors so we started to debug this problem in our lab environment which has at most 10-20 routers. To reproduce this we stopped an l3-agent completely until the alive state is down and routers flip into standy, after starting the agent again some states as in production also dont get back into active again. We dug quite deep into the code and what we see for routers that are not functioning correctly is that they only get into the _process_added_router function [1] and never go into the _process_updated_router function [2] For all other routers that work we see that they first hit [1] and then a couple of seconds later they go into [2] which then sets the correct state again. What is quite confusing is that it happens for different routers on each stop/start sequence of the l3-agent and restarting an agent sometimes fixes this and sometimes it does not. At this point we are not really sure how to debug this further as we are not really experienced how and where update events come from. Does anyone has an idea where this could be broken or point us in any direction how to debug this further? Neutron is running on wallaby(18.5.0). Thanks in advance [1] https://github.com/openstack/neutron/blob/18.5.0/neutron/agent/l3/agent.py#L631 [2] https://github.com/openstack/neutron/blob/18.5.0/neutron/agent/l3/agent.py#L633 ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/2009043 Title: neutron-l3-agent restart some random ha routers get wrong state Status in neutron: New Bug description: Since a couple of weeks we have a problem in our production environment when restarting our l3-agent. (Our assumption is that this might has something to do with our upgrade to wallaby, as we never saw this problem on prior releases before.) The l3 agent is hosting around 300 ha routers so when restarting the agent it takes a couple of seconds which results in the alive state to go down and therefore all active routers that were hosted on that agent flip to standby state. Now when the agent finished its startup it should set the correct active state for its routers again but fails for some random amount of routers. It does not log any exceptions or errors so we started to debug this problem in our lab environment which has at most 10-20 routers. To reproduce this we stopped an l3-agent completely until the alive state is down and routers flip into standy, after starting the agent again some states as in production also dont get back into active again. We dug quite deep into the code and what we see for routers that are not functioning correctly is that they only get into the _process_added_router function [1] and never go into the _process_updated_router function [2] For all other routers that work we see that they first hit [1] and then a couple of seconds later they go into [2] which then sets the correct state again. What is quite confusing is that it happens for different routers on each stop/start sequence of the l3-agent and restarting an agent sometimes fixes this and sometimes it does not. At this point we are not really sure how to debug this further as we are not really experienced how and where update events come from. Does anyone has an idea where this could be broken or point us in any direction how to debug this further? Neutron is running on wallaby(18.5.0). Thanks in advance [1] https://github.com/openstack/neutron/blob/18.5.0/neutron/agent/l3/agent.py#L631 [2] https://github.com/openstack/neutron/blob/18.5.0/neutron/agent/l3/agent.py#L633 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/2009043/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : [email protected] Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp

