Reviewed: https://review.openstack.org/347708 Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=bb989be99db84a2789abe2849c786a075e3f5ab7 Submitter: Jenkins Branch: master
commit bb989be99db84a2789abe2849c786a075e3f5ab7 Author: John Schwarz <[email protected]> Date: Wed Jul 27 12:09:30 2016 +0300 Don't use exponential back-off for report_state If an agent tries to report_state to the neutron-server and it fails because of a timeout (raising oslo_messaging.MessagingTimeout), then there is an exponential back-off effect, which causes the seemingly-simple report_state RPC call to take 60 seconds, then 120, then 240 and so on. This can happen if all the controllers are restarted simultaneously a number of time, as the bug report describes. Since the feature was intended for heavy RPC calls (like get_routers()) and not for light calls such as report_state, it's safe to reduce the timeout to a constant 60 seconds interval. Closes-Bug: #1606827 Change-Id: I15aeea9f8265b859bb1a8ee933b8b2ce1e64b695 ** Changed in: neutron Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1606827 Title: Agents might be reported as down for 10 minutes after all controllers restart Status in neutron: Fix Released Bug description: The scenario which initially revealed this issue involved multiple controllers and an extra compute node (total of 4) but it should also reproduce on deployments smaller than described. The issue is that if an agent tries to report_state to the neutron- server and it fails because of a timeout (raising oslo_messaging.MessagingTimeout), then there is an exponential back- off effect which was put in place by [1]. The feature was intended for heavy RPC calls (like get_routers()) and not for light calls such as report_state, so this can be considered a regression. This can be reproduced by restarting the controllers on a triple-O deployment and specified before. A solution would be to ensure PluginReportStateAPI doesn't use the exponential backoff, instead seeking to always time out after rpc_response_timeout. [1]: https://review.openstack.org/#/c/280595/14/neutron/common/rpc.py To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1606827/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : [email protected] Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp

