Public bug reported:

Summary
======= 
Openstack operators deploying the L3 agent might need to tune the 
SYNC_ROUTERS_MIN/MAX_CHUNK_SIZE parameters to avoid flooding the neutron-server

High level description
======================
neutron l3 agent and its derivatives (neutron-vpn-agent) will do a full sync 
when they start. The process is to fetch a list of associated routers from the 
neutron-server then issue a sync_routers RPC call with the difference delta of 
what they have online and what they need to synchronise.
The call time is linear dependent on the number of routers associated to that 
agent and might result in rpc timeout if the server is overloaded (like in the 
situation of a complete datacenter outage or a multi-step upgrade). The l3 
agent will attempt to chunk out the call if oslo_messaging.MessagingTimeout is 
caught but by the time it eventually scales down the chunk size the server may 
be already swamped with calls and will take a considerable time to start 
onlining routers.

Pre-conditions
==============
We faced this issue in a production environment and managed to reproduce an 
approximate behaviour in a pre-production environment.

Details of the test environment:
* 4 instances of the neutron-server
 - 8 RPC workers
 - 8 API workers
 - 700 networks with 1 subnetwork each
 - 100 tenants
 - 9 external networks
 - 1 shared network with instances attached to it
* 6 neutron vpn agents (also tested with neutron-l3-agent)
 - L3 HA configured
 - no l2-population configured
 - 240 routers scheduled per agent
 - rpc_timeout = 600
* 3 nova-compute nodes
 - running 600 instances
 - 100 instances with 2 network interfaces
 - 50 instances attached to the shared network

Observations:
* sync_routers RPC call takes 7-10 minutes to get processed
* in production we observe messaging timeout and chunk scaling after 40 minutes
* in this environment we don't see RPC timeout but still the sync_routers call 
would exceed the rpc_timeout of 60 and would trigger neutron-server to consume 
100% CPU for almost 40 minutes before eventually scaling down the chunk size 
and managing to fully online all the routers

Modifications:
We modified neutron/agent/l3/agent.py on the L3 agent nodes and set:
SYNC_ROUTERS_MAX_CHUNK_SIZE = 32
SYNC_ROUTERS_MIN_CHUNK_SIZE = 8
... this resulted in the neutron-l3-agent starting to create qrouter-* 
namespaces after 10 seconds from a clean restart.
Clean restart for this test is to kill all keepalived and neutron agent 
processes, delete ovs ports and delete all namespaces from the node. This 
effectively ensures a full clean resync.

Versions tested:
* stable/mitaka (head)
* 8.4.0 tag
* 8.3.0 tag
I checked the code and the logic is the same in master so I don't expect much 
improvement with newton or ocata.

I want to propose we make these hardcoded values operator parametrisable while 
keeping the current defaults. It would not change the behaviour of the code for 
anybody except for operators which need to adjust these values and not require 
us to keep private patches.
I have a working patch set I can submit upstream for this which should be back 
portable all the way back to mitaka.

** Affects: neutron
     Importance: Undecided
         Status: New


** Tags: neutron neutron-l3-agent neutron-von-agent rpc slow sync-routers

** Tags added: neutron-l3-agent

** Tags added: neutron-von-agent

** Tags added: rpc slow

** Tags added: sync-routers

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1692971

Title:
  neutron operators using L3 agent might need to tune
  SYNC_ROUTERS_MAX_CHUNK_SIZE and SYNC_ROUTERS_MIN_CHUNK_SIZE

Status in neutron:
  New

Bug description:
  Summary
  ======= 
  Openstack operators deploying the L3 agent might need to tune the 
SYNC_ROUTERS_MIN/MAX_CHUNK_SIZE parameters to avoid flooding the neutron-server

  High level description
  ======================
  neutron l3 agent and its derivatives (neutron-vpn-agent) will do a full sync 
when they start. The process is to fetch a list of associated routers from the 
neutron-server then issue a sync_routers RPC call with the difference delta of 
what they have online and what they need to synchronise.
  The call time is linear dependent on the number of routers associated to that 
agent and might result in rpc timeout if the server is overloaded (like in the 
situation of a complete datacenter outage or a multi-step upgrade). The l3 
agent will attempt to chunk out the call if oslo_messaging.MessagingTimeout is 
caught but by the time it eventually scales down the chunk size the server may 
be already swamped with calls and will take a considerable time to start 
onlining routers.

  Pre-conditions
  ==============
  We faced this issue in a production environment and managed to reproduce an 
approximate behaviour in a pre-production environment.

  Details of the test environment:
  * 4 instances of the neutron-server
   - 8 RPC workers
   - 8 API workers
   - 700 networks with 1 subnetwork each
   - 100 tenants
   - 9 external networks
   - 1 shared network with instances attached to it
  * 6 neutron vpn agents (also tested with neutron-l3-agent)
   - L3 HA configured
   - no l2-population configured
   - 240 routers scheduled per agent
   - rpc_timeout = 600
  * 3 nova-compute nodes
   - running 600 instances
   - 100 instances with 2 network interfaces
   - 50 instances attached to the shared network

  Observations:
  * sync_routers RPC call takes 7-10 minutes to get processed
  * in production we observe messaging timeout and chunk scaling after 40 
minutes
  * in this environment we don't see RPC timeout but still the sync_routers 
call would exceed the rpc_timeout of 60 and would trigger neutron-server to 
consume 100% CPU for almost 40 minutes before eventually scaling down the chunk 
size and managing to fully online all the routers

  Modifications:
  We modified neutron/agent/l3/agent.py on the L3 agent nodes and set:
  SYNC_ROUTERS_MAX_CHUNK_SIZE = 32
  SYNC_ROUTERS_MIN_CHUNK_SIZE = 8
  ... this resulted in the neutron-l3-agent starting to create qrouter-* 
namespaces after 10 seconds from a clean restart.
  Clean restart for this test is to kill all keepalived and neutron agent 
processes, delete ovs ports and delete all namespaces from the node. This 
effectively ensures a full clean resync.

  Versions tested:
  * stable/mitaka (head)
  * 8.4.0 tag
  * 8.3.0 tag
  I checked the code and the logic is the same in master so I don't expect much 
improvement with newton or ocata.

  I want to propose we make these hardcoded values operator parametrisable 
while keeping the current defaults. It would not change the behaviour of the 
code for anybody except for operators which need to adjust these values and not 
require us to keep private patches.
  I have a working patch set I can submit upstream for this which should be 
back portable all the way back to mitaka.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1692971/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to