Public bug reported:

In the L3 agent's _process_routers_loop method, it spawns a GreenPool
with 8 eventlet threads. Those threads then take updates off the agent's
queue and process router updates. Router updates are serialized by
router_id so that two threads don't process the same router at any given
time.

In an environment running on a powerful baremetal server, on agent
restart it was trying to sync roughly 600 routers. Around half were HA
routers, and half were legacy routers. With the default GreenPool size
of 8, the result was that the server ground to a halt as CPU usage
skyrocketed to over 600%. The main offenders were ip, bash, keepalived
and Python. This was on an environment without rootwrap daemon based off
stable/juno. It took around 60 seconds to configure a single router.
Changing the GreenPool size from 8 to 1, caused the agent to:

1) Configure a router in 30 seconds, a 50% improvement.
2) Reduce CPU load from 600% to 70%, freeing the machine to do other things.

I'm filing this bug so that:

1) Someone can confirm my personal experience in a more controlled way - For 
example, graph router configuration time and CPU load as a result of GreenPool 
size.
2) If my findings are confirmed on master with rootwrap daemon, start 
considering alternatives like multiprocessing instead of eventlet 
multithreading, or at the very least optimize the GreenPool size.

** Affects: neutron
     Importance: Undecided
         Status: New


** Tags: l3-ipam-dhcp loadimpact

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1526559

Title:
  L3 agent parallel configuration of routers might slow things down

Status in neutron:
  New

Bug description:
  In the L3 agent's _process_routers_loop method, it spawns a GreenPool
  with 8 eventlet threads. Those threads then take updates off the
  agent's queue and process router updates. Router updates are
  serialized by router_id so that two threads don't process the same
  router at any given time.

  In an environment running on a powerful baremetal server, on agent
  restart it was trying to sync roughly 600 routers. Around half were HA
  routers, and half were legacy routers. With the default GreenPool size
  of 8, the result was that the server ground to a halt as CPU usage
  skyrocketed to over 600%. The main offenders were ip, bash, keepalived
  and Python. This was on an environment without rootwrap daemon based
  off stable/juno. It took around 60 seconds to configure a single
  router. Changing the GreenPool size from 8 to 1, caused the agent to:

  1) Configure a router in 30 seconds, a 50% improvement.
  2) Reduce CPU load from 600% to 70%, freeing the machine to do other things.

  I'm filing this bug so that:

  1) Someone can confirm my personal experience in a more controlled way - For 
example, graph router configuration time and CPU load as a result of GreenPool 
size.
  2) If my findings are confirmed on master with rootwrap daemon, start 
considering alternatives like multiprocessing instead of eventlet 
multithreading, or at the very least optimize the GreenPool size.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1526559/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to