[Yahoo-eng-team] [Bug 1785582] [NEW] Connectivity to instance after L3 router migration from Legacy to HA fails

Slawek Kaplonski Mon, 06 Aug 2018 02:43:51 -0700

Public bug reported:

Scenario test 
neutron.tests.tempest.scenario.test_migration.NetworkMigrationFromLegacy.test_from_legacy_to_ha
fails because of no connectivity to VM after migration.
We observed it on Pike version mostly but I think that the same issue might be 
also in newer versions.


Traceback (most recent call last):
  File 
"/usr/lib/python2.7/site-packages/neutron/tests/tempest/scenario/test_migration.py",
 line 68, in test_from_legacy_to_ha
    after_dvr=False, after_ha=True)
  File 
"/usr/lib/python2.7/site-packages/neutron/tests/tempest/scenario/test_migration.py",
 line 55, in _test_migration
    self._check_connectivity()
  File 
"/usr/lib/python2.7/site-packages/neutron/tests/tempest/scenario/test_dvr.py", 
line 29, in _check_connectivity
    self.keypair['private_key'])
  File 
"/usr/lib/python2.7/site-packages/neutron/tests/tempest/scenario/base.py", line 
204, in check_connectivity
    ssh_client.test_connection_auth()
  File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 207, 
in test_connection_auth
    connection = self._get_ssh_connection()
  File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 121, 
in _get_ssh_connection
    password=self.password)
tempest.lib.exceptions.SSHTimeout: Connection to the 10.0.0.224 via SSH timed 
out.
User: cirros, Password: None


>From my investigation it looks that it is because of race between two 
>different operations on router.

1. router is switched to admin_state down, so port is set to DOWN also,
2. neutron-server got info from ovs agent that port is down
3. but now, other thread changes router from legacy to ha so owner of this port 
changes from DEVICE_OWNER_ROUTER_INTF to DEVICE_OWNER_HA_REPLICATED_INT and 
also router is still "on" this host (as it's now backup node for router) so in 
https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/l2pop/mech_driver.py#L258
 l2pop says: ok, I'm not sending remove_fdb_entries to this mac address on this 
port and old entries are still on other nodes :/ because later when this port 
is up on different host (new master node) add_fdb_entries is also not send to 
hosts because of 
https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/l2pop/mech_driver.py#L307
 which was added in 
https://github.com/openstack/neutron/commit/26d8702b9d7cc5a4293b97bc435fa85983be9f01

I tried to run this tests with waiting until router's port will be really down 
before calling migration to HA and then it passed 151 times for me. So it 
clearly shows that this is an issue here.
I think that it should be fixed in neutron's code instead of test as this isn't 
test-only issue.

** Affects: neutron
     Importance: Medium
     Assignee: Slawek Kaplonski (slaweq)
         Status: Confirmed


** Tags: l3-ha

** Changed in: neutron
     Assignee: (unassigned) => Slawek Kaplonski (slaweq)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1785582

Title:
  Connectivity to instance after L3 router migration from Legacy to HA
  fails

Status in neutron:
  Confirmed

Bug description:
  Scenario test 
neutron.tests.tempest.scenario.test_migration.NetworkMigrationFromLegacy.test_from_legacy_to_ha
  fails because of no connectivity to VM after migration.
  We observed it on Pike version mostly but I think that the same issue might 
be also in newer versions.

  Traceback (most recent call last):
    File 
"/usr/lib/python2.7/site-packages/neutron/tests/tempest/scenario/test_migration.py",
 line 68, in test_from_legacy_to_ha
      after_dvr=False, after_ha=True)
    File 
"/usr/lib/python2.7/site-packages/neutron/tests/tempest/scenario/test_migration.py",
 line 55, in _test_migration
      self._check_connectivity()
    File 
"/usr/lib/python2.7/site-packages/neutron/tests/tempest/scenario/test_dvr.py", 
line 29, in _check_connectivity
      self.keypair['private_key'])
    File 
"/usr/lib/python2.7/site-packages/neutron/tests/tempest/scenario/base.py", line 
204, in check_connectivity
      ssh_client.test_connection_auth()
    File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 
207, in test_connection_auth
      connection = self._get_ssh_connection()
    File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 
121, in _get_ssh_connection
      password=self.password)
  tempest.lib.exceptions.SSHTimeout: Connection to the 10.0.0.224 via SSH timed 
out.
  User: cirros, Password: None

  
  From my investigation it looks that it is because of race between two 
different operations on router.

  1. router is switched to admin_state down, so port is set to DOWN also,
  2. neutron-server got info from ovs agent that port is down
  3. but now, other thread changes router from legacy to ha so owner of this 
port changes from DEVICE_OWNER_ROUTER_INTF to DEVICE_OWNER_HA_REPLICATED_INT 
and also router is still "on" this host (as it's now backup node for router) so 
in 
https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/l2pop/mech_driver.py#L258
 l2pop says: ok, I'm not sending remove_fdb_entries to this mac address on this 
port and old entries are still on other nodes :/ because later when this port 
is up on different host (new master node) add_fdb_entries is also not send to 
hosts because of 
https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/l2pop/mech_driver.py#L307
 which was added in 
https://github.com/openstack/neutron/commit/26d8702b9d7cc5a4293b97bc435fa85983be9f01

  I tried to run this tests with waiting until router's port will be really 
down before calling migration to HA and then it passed 151 times for me. So it 
clearly shows that this is an issue here.
  I think that it should be fixed in neutron's code instead of test as this 
isn't test-only issue.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1785582/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1785582] [NEW] Connectivity to instance after L3 router migration from Legacy to HA fails

Reply via email to