Reviewed: https://review.opendev.org/703612 Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=18410097f23a8e3d9cf33393b47d8b1a91020e4a Submitter: Zuul Branch: master
commit 18410097f23a8e3d9cf33393b47d8b1a91020e4a Author: Daniel Alvarez <[email protected]> Date: Tue Jan 21 14:26:22 2020 +0100 [ovn] Agent liveness - allow time to propagate checks Right now neutron-server bumps the nb_cfg parameter in NB_Global table which needs to be propagated by northd to SB_Global, processed by agents, and write it back into SB_Global. This requires processing by neutron-server but unfortunatelly the server checks straight away and many times the value read is behind the expected value. All this results in frequent false positives showing dead agents when they are not. This patch is relaxing the checks by allowing a difference of 1 between the read and expected values. Change-Id: Id91481b690ad569c5dcfa5bd404f497f591d729d Closes-Bug: 1860436 Signed-off-by: Daniel Alvarez <[email protected]> ** Changed in: neutron Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1860436 Title: [ovn] Agent liveness checks are flaky and report false positives Status in neutron: Fix Released Bug description: The way that networking-ovn mech driver performs health checks on agents reports false positives due to race conditions: 1) neutron-server increments the nb_cfg in NB_Global table from X to X+1 2) neutron-server almost immediately checks all the Chassis rows to see if they have written (X+1) . [1] 3) neutron-server process the updates from each agent from X to X+1 *Most* of the times, in step number 2, this condition doesn't hold so the timestamp is not updated. The result is that after 60 seconds (agent timeout default value), the agent is shown as dead. Sometimes, 3) happens before 2) so the timestamp gets updated and all is fine but this is not the normal case: 1) Bump of nb_cfg 2020-01-21 11:35:59.534 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] XXX nb_cfg = 36915 2020-01-21 11:35:59.538 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] XXX nb_cfg = 36916 2) Check of each chassis ext_id against our new bumped nb_cfg: 2020-01-21 11:35:59.539 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915 2020-01-21 11:35:59.540 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915 2020-01-21 11:35:59.541 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915 2020-01-21 11:35:59.542 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915 2020-01-21 11:35:59.543 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915 2020-01-21 11:35:59.544 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915 2020-01-21 11:35:59.546 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915 3) Processing updates [2] in the ChassisEvent (some are even older!) 2020-01-21 11:35:59.546 30 INFO networking_ovn.ovsdb.ovsdb_monitor [req-1906156e-a089-4bde-b9bc-c9f4f9655a3d - - - - -] XXX chassis update: 36915 2020-01-21 11:35:59.548 29 INFO networking_ovn.ovsdb.ovsdb_monitor [req-072386aa-87e9-486c-bb6f-3dd2bdc038bd - - - - -] XXX chassis update: 36915 2020-01-21 11:35:59.556 32 INFO networking_ovn.ovsdb.ovsdb_monitor [req-efa34cac-2296-4d30-b153-9630b0309fcd - - - - -] XXX chassis update: 2020-01-21 11:35:59.556 27 INFO networking_ovn.ovsdb.ovsdb_monitor [req-91f7d181-bfa3-4646-9814-bb680d011081 - - - - -] XXX chassis update: 2020-01-21 11:35:59.557 25 INFO networking_ovn.ovsdb.ovsdb_monitor [req-420e5a25-13e4-4da6-8277-8a3a1028c9e9 - - - - -] XXX chassis update: 2020-01-21 11:35:59.756 30 INFO networking_ovn.ovsdb.ovsdb_monitor [req-1906156e-a089-4bde-b9bc-c9f4f9655a3d - - - - -] XXX chassis update: 36916 2020-01-21 11:35:59.778 29 INFO networking_ovn.ovsdb.ovsdb_monitor [req-072386aa-87e9-486c-bb6f-3dd2bdc038bd - - - - -] XXX chassis update: 36916 IMO, we need to space the bump of nb_cfg [2] and the check [3] in time as the NB_Global changes needs to be propagated to the SB, processed by all agents and then back to neutron-server which needs to process the JSON stuff and update the internal tables. So even if it's fast, most of the times it is not fast enough. Another solution is to allow a difference of '1' to update timestamps. [0] https://opendev.org/openstack/networking-ovn/src/branch/master/networking_ovn/ml2/mech_driver.py#L1093 [1] https://opendev.org/openstack/networking-ovn/src/branch/master/networking_ovn/ml2/mech_driver.py#L1098 [2] https://github.com/openstack/networking-ovn/blob/bf577e5a999f7db4cb9b790664ad596e1926d9a0/networking_ovn/ml2/mech_driver.py#L988 [3] https://github.com/openstack/networking-ovn/blob/6302298e9c4313f1200c543c89d92629daff9e89/networking_ovn/ovsdb/ovsdb_monitor.py#L74 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1860436/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : [email protected] Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp

