Public bug reported: The scenario which initially revealed this issue involved multiple controllers and an extra compute node (total of 4) but it should also reproduce on deployments smaller than described.
The issue is that if an agent tries to report_state to the neutron- server and it fails because of a timeout (raising oslo_messaging.MessagingTimeout), then there is an exponential back-off effect which was put in place by [1]. The feature was intended for heavy RPC calls (like get_routers()) and not for light calls such as report_state, so this can be considered a regression. This can be reproduced by restarting the controllers on a triple-O deployment and specified before. A solution would be to ensure PluginReportStateAPI doesn't use the exponential backoff, instead seeking to always time out after rpc_response_timeout. [1]: https://review.openstack.org/#/c/280595/14/neutron/common/rpc.py ** Affects: neutron Importance: Undecided Assignee: John Schwarz (jschwarz) Status: In Progress ** Tags: liberty-backport-potential mitaka-backport-potential ** Description changed: The scenario which initially revealed this issue involved multiple controllers and an extra compute node (total of 4) but it should also reproduce on deployments smaller than described. The issue is that if an agent tries to report_state to the neutron- server and it fails because of a timeout (raising oslo_messaging.MessagingTimeout), then there is an exponential back-off effect which was put in place by [1]. The feature was intended for heavy RPC calls (like get_routers()) and not for light calls such as - report_state, so this can be considered a regression. + report_state, so this can be considered a regression. This can be + reproduced by restarting the controllers on a triple-O deployment and + specified before. A solution would be to ensure PluginReportStateAPI doesn't use the exponential backoff, instead seeking to always time out after rpc_response_timeout. [1]: https://review.openstack.org/#/c/280595/14/neutron/common/rpc.py ** Tags added: mitaka-backport-potential ** Tags added: liberty-backport-potential -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1606827 Title: Agents might be reported as down for 10 minutes after all controllers restart Status in neutron: In Progress Bug description: The scenario which initially revealed this issue involved multiple controllers and an extra compute node (total of 4) but it should also reproduce on deployments smaller than described. The issue is that if an agent tries to report_state to the neutron- server and it fails because of a timeout (raising oslo_messaging.MessagingTimeout), then there is an exponential back- off effect which was put in place by [1]. The feature was intended for heavy RPC calls (like get_routers()) and not for light calls such as report_state, so this can be considered a regression. This can be reproduced by restarting the controllers on a triple-O deployment and specified before. A solution would be to ensure PluginReportStateAPI doesn't use the exponential backoff, instead seeking to always time out after rpc_response_timeout. [1]: https://review.openstack.org/#/c/280595/14/neutron/common/rpc.py To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1606827/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp