Public bug reported:

The agent/server communication pattern we use now can lead to cascading
failures making the servers unavailable.

The current pattern in our communications between the Neutron server and
the agents looks like the following:


Server sends: item <item-uuid> changed
Client receives event.
Client makes a call to the server asking for the item details.


The calls the client makes to the server can be expensive and a server under 
heavy load can take a long time to start processing the request and/or to 
fulfill the request. This can trigger a timeout on the agent side, which leads 
to a retry, or, even worse, a generic fallback to resync the entire state. This 
creates a thundering herd problem where a server falling behind on requests 
will be continually stampeded by retries from agents that have timed out by the 
time the server can respond.


The pattern of agent/server communication needs to be adjusted to assume 
terrible server response times at a minimum. Optimally, all of the 
notifications generated by the servers should be adjusted to include all of the 
relevant information that an agent will need to respond to an event so the only 
time an agent has to actually call the server is on startup to get initial 
state.

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1516195

Title:
  Push all object information in AMQP notifications

Status in neutron:
  New

Bug description:
  The agent/server communication pattern we use now can lead to
  cascading failures making the servers unavailable.

  The current pattern in our communications between the Neutron server
  and the agents looks like the following:

  
  Server sends: item <item-uuid> changed
  Client receives event.
  Client makes a call to the server asking for the item details.

  
  The calls the client makes to the server can be expensive and a server under 
heavy load can take a long time to start processing the request and/or to 
fulfill the request. This can trigger a timeout on the agent side, which leads 
to a retry, or, even worse, a generic fallback to resync the entire state. This 
creates a thundering herd problem where a server falling behind on requests 
will be continually stampeded by retries from agents that have timed out by the 
time the server can respond.

  
  The pattern of agent/server communication needs to be adjusted to assume 
terrible server response times at a minimum. Optimally, all of the 
notifications generated by the servers should be adjusted to include all of the 
relevant information that an agent will need to respond to an event so the only 
time an agent has to actually call the server is on startup to get initial 
state.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1516195/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to