Public bug reported:

We've been debugging some issues being seen lately [0] and found out
that there's a bug in l3 agent when creating routers (or during initial
sync). Jakub Libosvar and I spent some time recreating the issue and
this is what we got:

Especially since we bumped to ovsdbapp 0.8.0, we've seen some jobs
failing due to errors when authenticating using PK to a VM. The TCP
connection to the SSH port was successfully established but the
authentication failed. After debugging further, we found out that
metadata rules in qrouter namespace which redirect traffic to haproxy
(which replaced old neutron-ns-metadata-proxy) were missing, so VM's
weren't fetching metadata (hence, public key).

These rules are installed by metadata driver after a router is created [1] on 
the AFTER_CREATE notification. Also, they will get created during the initial 
sync of the l3 agent (since it's still unknown for the agent) [2]. Here, if we 
don't know the router yet, we'll call _proccess_added_router() and if it's a 
known router we'll call _process_updated_router().
After our tests, we've seen that iptables rules are never restored if we 
simulate an
Exception inside ri.process() at [3] even though the router is scheduled for 
resync [4]. The reason why this happens is because we've already added it to 
our router info [5] so even though
ri.process() fails at L481 and it's scheduled for resync, next time 
_process_updated_router()
will get called instead of _process_added_router() thus not pushing the 
notification into
metadata driver to install iptables rules and they never get installed. 

In conclusion, if an error occurs during _process_added_router() we might end 
up losing
metadata forever until we restart the agent and this call succeeds. Worse, we 
will be
forwarding metadata requests via br-ex which could lead to security issues (ie. 
we could be injecting wrong metadata from the outside or the metadata server 
running in the underlying cloud may respond).

With ovsdbapp 0.9.0 we're minimizing this because if a port fails to be added 
to br-int, ovsdbapp will enqueue the transaction instead of throwing an 
Exception but there could be still some other exceptions I guess that 
reproduces this scenario outside of ovsdbapp so we need to fix it
in Neutron.

Thanks
Daniel Alvarez

---

[0] https://bugs.launchpad.net/tripleo/+bug/1731063
[1] 
https://github.com/openstack/neutron/blob/02fa049c5f5a38a276bec6e55c68ac19cd08c59f/neutron/agent/metadata/driver.py#L288
[2] 
https://github.com/openstack/neutron/blob/02fa049c5f5a38a276bec6e55c68ac19cd08c59f/neutron/agent/l3/agent.py#L472
[3] 
https://github.com/openstack/neutron/blob/02fa049c5f5a38a276bec6e55c68ac19cd08c59f/neutron/agent/l3/agent.py#L481
[4] 
https://github.com/openstack/neutron/blob/02fa049c5f5a38a276bec6e55c68ac19cd08c59f/neutron/agent/l3/agent.py#L565
[5] 
https://github.com/openstack/neutron/blob/02fa049c5f5a38a276bec6e55c68ac19cd08c59f/neutron/agent/l3/agent.py#L478

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1735724

Title:
  Metadata iptables rules never inserted upon exception on router
  creation

Status in neutron:
  New

Bug description:
  We've been debugging some issues being seen lately [0] and found out
  that there's a bug in l3 agent when creating routers (or during
  initial sync). Jakub Libosvar and I spent some time recreating the
  issue and this is what we got:

  Especially since we bumped to ovsdbapp 0.8.0, we've seen some jobs
  failing due to errors when authenticating using PK to a VM. The TCP
  connection to the SSH port was successfully established but the
  authentication failed. After debugging further, we found out that
  metadata rules in qrouter namespace which redirect traffic to haproxy
  (which replaced old neutron-ns-metadata-proxy) were missing, so VM's
  weren't fetching metadata (hence, public key).

  These rules are installed by metadata driver after a router is created [1] on 
the AFTER_CREATE notification. Also, they will get created during the initial 
sync of the l3 agent (since it's still unknown for the agent) [2]. Here, if we 
don't know the router yet, we'll call _proccess_added_router() and if it's a 
known router we'll call _process_updated_router().
  After our tests, we've seen that iptables rules are never restored if we 
simulate an
  Exception inside ri.process() at [3] even though the router is scheduled for 
resync [4]. The reason why this happens is because we've already added it to 
our router info [5] so even though
  ri.process() fails at L481 and it's scheduled for resync, next time 
_process_updated_router()
  will get called instead of _process_added_router() thus not pushing the 
notification into
  metadata driver to install iptables rules and they never get installed. 

  In conclusion, if an error occurs during _process_added_router() we might end 
up losing
  metadata forever until we restart the agent and this call succeeds. Worse, we 
will be
  forwarding metadata requests via br-ex which could lead to security issues 
(ie. we could be injecting wrong metadata from the outside or the metadata 
server running in the underlying cloud may respond).

  With ovsdbapp 0.9.0 we're minimizing this because if a port fails to be added 
to br-int, ovsdbapp will enqueue the transaction instead of throwing an 
Exception but there could be still some other exceptions I guess that 
reproduces this scenario outside of ovsdbapp so we need to fix it
  in Neutron.

  Thanks
  Daniel Alvarez

  ---

  [0] https://bugs.launchpad.net/tripleo/+bug/1731063
  [1] 
https://github.com/openstack/neutron/blob/02fa049c5f5a38a276bec6e55c68ac19cd08c59f/neutron/agent/metadata/driver.py#L288
  [2] 
https://github.com/openstack/neutron/blob/02fa049c5f5a38a276bec6e55c68ac19cd08c59f/neutron/agent/l3/agent.py#L472
  [3] 
https://github.com/openstack/neutron/blob/02fa049c5f5a38a276bec6e55c68ac19cd08c59f/neutron/agent/l3/agent.py#L481
  [4] 
https://github.com/openstack/neutron/blob/02fa049c5f5a38a276bec6e55c68ac19cd08c59f/neutron/agent/l3/agent.py#L565
  [5] 
https://github.com/openstack/neutron/blob/02fa049c5f5a38a276bec6e55c68ac19cd08c59f/neutron/agent/l3/agent.py#L478

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1735724/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to