Public bug reported:
Summary
=======
Openstack operators deploying the L3 agent might need to tune the
SYNC_ROUTERS_MIN/MAX_CHUNK_SIZE parameters to avoid flooding the neutron-server
High level description
======================
neutron l3 agent and its derivatives (neutron-vpn-agent) will do a full sync
when they start. The process is to fetch a list of associated routers from the
neutron-server then issue a sync_routers RPC call with the difference delta of
what they have online and what they need to synchronise.
The call time is linear dependent on the number of routers associated to that
agent and might result in rpc timeout if the server is overloaded (like in the
situation of a complete datacenter outage or a multi-step upgrade). The l3
agent will attempt to chunk out the call if oslo_messaging.MessagingTimeout is
caught but by the time it eventually scales down the chunk size the server may
be already swamped with calls and will take a considerable time to start
onlining routers.
Pre-conditions
==============
We faced this issue in a production environment and managed to reproduce an
approximate behaviour in a pre-production environment.
Details of the test environment:
* 4 instances of the neutron-server
- 8 RPC workers
- 8 API workers
- 700 networks with 1 subnetwork each
- 100 tenants
- 9 external networks
- 1 shared network with instances attached to it
* 6 neutron vpn agents (also tested with neutron-l3-agent)
- L3 HA configured
- no l2-population configured
- 240 routers scheduled per agent
- rpc_timeout = 600
* 3 nova-compute nodes
- running 600 instances
- 100 instances with 2 network interfaces
- 50 instances attached to the shared network
Observations:
* sync_routers RPC call takes 7-10 minutes to get processed
* in production we observe messaging timeout and chunk scaling after 40 minutes
* in this environment we don't see RPC timeout but still the sync_routers call
would exceed the rpc_timeout of 60 and would trigger neutron-server to consume
100% CPU for almost 40 minutes before eventually scaling down the chunk size
and managing to fully online all the routers
Modifications:
We modified neutron/agent/l3/agent.py on the L3 agent nodes and set:
SYNC_ROUTERS_MAX_CHUNK_SIZE = 32
SYNC_ROUTERS_MIN_CHUNK_SIZE = 8
... this resulted in the neutron-l3-agent starting to create qrouter-*
namespaces after 10 seconds from a clean restart.
Clean restart for this test is to kill all keepalived and neutron agent
processes, delete ovs ports and delete all namespaces from the node. This
effectively ensures a full clean resync.
Versions tested:
* stable/mitaka (head)
* 8.4.0 tag
* 8.3.0 tag
I checked the code and the logic is the same in master so I don't expect much
improvement with newton or ocata.
I want to propose we make these hardcoded values operator parametrisable while
keeping the current defaults. It would not change the behaviour of the code for
anybody except for operators which need to adjust these values and not require
us to keep private patches.
I have a working patch set I can submit upstream for this which should be back
portable all the way back to mitaka.
** Affects: neutron
Importance: Undecided
Status: New
** Tags: neutron neutron-l3-agent neutron-von-agent rpc slow sync-routers
** Tags added: neutron-l3-agent
** Tags added: neutron-von-agent
** Tags added: rpc slow
** Tags added: sync-routers
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1692971
Title:
neutron operators using L3 agent might need to tune
SYNC_ROUTERS_MAX_CHUNK_SIZE and SYNC_ROUTERS_MIN_CHUNK_SIZE
Status in neutron:
New
Bug description:
Summary
=======
Openstack operators deploying the L3 agent might need to tune the
SYNC_ROUTERS_MIN/MAX_CHUNK_SIZE parameters to avoid flooding the neutron-server
High level description
======================
neutron l3 agent and its derivatives (neutron-vpn-agent) will do a full sync
when they start. The process is to fetch a list of associated routers from the
neutron-server then issue a sync_routers RPC call with the difference delta of
what they have online and what they need to synchronise.
The call time is linear dependent on the number of routers associated to that
agent and might result in rpc timeout if the server is overloaded (like in the
situation of a complete datacenter outage or a multi-step upgrade). The l3
agent will attempt to chunk out the call if oslo_messaging.MessagingTimeout is
caught but by the time it eventually scales down the chunk size the server may
be already swamped with calls and will take a considerable time to start
onlining routers.
Pre-conditions
==============
We faced this issue in a production environment and managed to reproduce an
approximate behaviour in a pre-production environment.
Details of the test environment:
* 4 instances of the neutron-server
- 8 RPC workers
- 8 API workers
- 700 networks with 1 subnetwork each
- 100 tenants
- 9 external networks
- 1 shared network with instances attached to it
* 6 neutron vpn agents (also tested with neutron-l3-agent)
- L3 HA configured
- no l2-population configured
- 240 routers scheduled per agent
- rpc_timeout = 600
* 3 nova-compute nodes
- running 600 instances
- 100 instances with 2 network interfaces
- 50 instances attached to the shared network
Observations:
* sync_routers RPC call takes 7-10 minutes to get processed
* in production we observe messaging timeout and chunk scaling after 40
minutes
* in this environment we don't see RPC timeout but still the sync_routers
call would exceed the rpc_timeout of 60 and would trigger neutron-server to
consume 100% CPU for almost 40 minutes before eventually scaling down the chunk
size and managing to fully online all the routers
Modifications:
We modified neutron/agent/l3/agent.py on the L3 agent nodes and set:
SYNC_ROUTERS_MAX_CHUNK_SIZE = 32
SYNC_ROUTERS_MIN_CHUNK_SIZE = 8
... this resulted in the neutron-l3-agent starting to create qrouter-*
namespaces after 10 seconds from a clean restart.
Clean restart for this test is to kill all keepalived and neutron agent
processes, delete ovs ports and delete all namespaces from the node. This
effectively ensures a full clean resync.
Versions tested:
* stable/mitaka (head)
* 8.4.0 tag
* 8.3.0 tag
I checked the code and the logic is the same in master so I don't expect much
improvement with newton or ocata.
I want to propose we make these hardcoded values operator parametrisable
while keeping the current defaults. It would not change the behaviour of the
code for anybody except for operators which need to adjust these values and not
require us to keep private patches.
I have a working patch set I can submit upstream for this which should be
back portable all the way back to mitaka.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1692971/+subscriptions
--
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : [email protected]
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help : https://help.launchpad.net/ListHelp