Recently we encountered some ovs-agent crash issues. [1][2][3] *[Root cause]* 1. Currently only a 'restarted' flag is used in rpc_loop() to identify ovs status. * ovs_restarted = self.check_ovs_restart() *
*True*: ovs is running, but a restart happened before this loop. rpc_loop() reset bridges and re-process ports. *False*: ovs is running since last loop, rpc_loop() continue to process in a normal way. But if ovs is dead, or is not up yet during a restart, check_ovs_restart() will incorrectly returns "True". Then rpc_loop() continues to reset bridges, and apply other ovs operations, till causing exceptions/crash. Related Bug: [1] [2] 2. Also, during agent boot up, ovs status is not checked at all. Agent crashes without no useful log info, when ovs is dead. Related Bug: [3] *[Proposal]* 1. Add const {NORMAL, DEAD, RESTARTED} to represent ovs status. NORMAL - ovs is running since last loop, rpc_loop() continue to process in a normal way. RESTARTED - ovs is running, but a restart happened before this loop. rpc_loop() reset bridges and re-process ports. DEAD - keep agent running, but rpc_loop() doesn't apply ovs operations to prevent unnecessary exceptions/crash. When ovs is up, it enters RESTARTED mode; 2. Check ovs status during agent boot up, if it's DEAD, exit graceful since subsequent operations causes a crash, and write log to remind that ovs_dead causes agent termination. *[Code Review]* https://review.openstack.org/#/c/110538/ Will be appreciated if you could share some thoughts or do a quick code review. Thanks. Best, Robin [1] https://bugs.launchpad.net/neutron/+bug/1296202 [2] https://bugs.launchpad.net/neutron/+bug/1350179 [3] https://bugs.launchpad.net/neutron/+bug/1351135
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev