[Yahoo-eng-team] [Bug 1633306] Re: Partial HA network causing HA router creation failed (race conditon)
Temporarily close this, I will test this in master, and reopen if it encountered. ** Changed in: neutron Status: Incomplete => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1633306 Title: Partial HA network causing HA router creation failed (race conditon) Status in neutron: Invalid Bug description: ENV: stable/mitaka,VXLAN Neutron API: two neutron-servers behind a HA proxy VIP. Log: [1] http://paste.openstack.org/show/585670/ Log [1] shows that the subnet of HA network is concurrently deleted while a new HA router create API comes. Seems the race conditon described in this bug is till exists : https://bugs.launchpad.net/neutron/+bug/1533440, where has description said: """ Some known exceptions: ... 2. IpAddressGenerationFailure: (HA port created failed due to the concurrently HA subnet deletion) ... """ It has a very strange behavior that those 3 APIs have a same request- id [req-780b1f6e-2b3c-4303-a1de-a5fb4c7ea31e]. Test scenario: Just create one HA router for a tenant, and then quickly delete it. For now, our mitaka ENV use VxLAN as tenant network type. So there is a very large range of VNI. So don't save that, and locally a temporary solution, we add a new config to decide whether delete the HA network every time. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1633306/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1633306] Re: Partial HA network causing HA router creation failed (race conditon)
Thanks John, now this bug only for one creation failed log, something I've described in comment #2. ** Description changed: ENV: stable/mitaka,VXLAN Neutron API: two neutron-servers behind a HA proxy VIP. - Exception log: - [1] http://paste.openstack.org/show/585669/ - [2] http://paste.openstack.org/show/585670/ + Log: + [1] http://paste.openstack.org/show/585670/ Log [1] shows that the subnet of HA network is concurrently deleted while a new HA router create API comes. Seems the race conditon described in this bug is till exists : https://bugs.launchpad.net/neutron/+bug/1533440, where has description said: """ Some known exceptions: ... 2. IpAddressGenerationFailure: (HA port created failed due to the concurrently HA subnet deletion) ... """ - Log [2] has a very strange behavior that those 3 APIs have a same - request-id [req-780b1f6e-2b3c-4303-a1de-a5fb4c7ea31e]. + It has a very strange behavior that those 3 APIs have a same request-id + [req-780b1f6e-2b3c-4303-a1de-a5fb4c7ea31e]. Test scenario: Just create one HA router for a tenant, and then quickly delete it. For now, our mitaka ENV use VxLAN as tenant network type. So there is a very large range of VNI. So don't save that, and locally a temporary solution, we add a new config to decide whether delete the HA network every time. ** Changed in: neutron Status: Invalid => New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1633306 Title: Partial HA network causing HA router creation failed (race conditon) Status in neutron: New Bug description: ENV: stable/mitaka,VXLAN Neutron API: two neutron-servers behind a HA proxy VIP. Log: [1] http://paste.openstack.org/show/585670/ Log [1] shows that the subnet of HA network is concurrently deleted while a new HA router create API comes. Seems the race conditon described in this bug is till exists : https://bugs.launchpad.net/neutron/+bug/1533440, where has description said: """ Some known exceptions: ... 2. IpAddressGenerationFailure: (HA port created failed due to the concurrently HA subnet deletion) ... """ It has a very strange behavior that those 3 APIs have a same request- id [req-780b1f6e-2b3c-4303-a1de-a5fb4c7ea31e]. Test scenario: Just create one HA router for a tenant, and then quickly delete it. For now, our mitaka ENV use VxLAN as tenant network type. So there is a very large range of VNI. So don't save that, and locally a temporary solution, we add a new config to decide whether delete the HA network every time. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1633306/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1633306] Re: Partial HA network causing HA router creation failed (race conditon)
Looking at the log involving the server ([1] - the same one you provided in the first comment and in comment #3), and specifically lines 19 and 21, it's clear that sync_routers() is triggering auto_schedule_routers(). Before [2] removed in, the call from sync_routers() to auto_schedule_routers() was done in line 96 of neutron/api/rpc/handlers/l3_rpc.py, as can be observed from the log: 2016-10-09 17:03:52.366 144166 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/neutron/api/rpc/handlers/l3_rpc.py", line 96, in sync_routers 2016-10-09 17:03:52.366 144166 ERROR oslo_messaging.rpc.dispatcher self.l3plugin.auto_schedule_routers(context, host, router_ids) In [2], it's evident that the line 96 itself is removed. Thus, this can't be reproduced in master or in stable/mitaka and there is no (upstream) bug to fix. [1]: http://paste.openstack.org/show/585669/ [2]: https://github.com/openstack/neutron/commit/33650bf1d1994a96eff993af0bfdaa62588f08a4 ** Changed in: neutron Status: New => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1633306 Title: Partial HA network causing HA router creation failed (race conditon) Status in neutron: Invalid Bug description: ENV: stable/mitaka,VXLAN Neutron API: two neutron-servers behind a HA proxy VIP. Exception log: [1] http://paste.openstack.org/show/585669/ [2] http://paste.openstack.org/show/585670/ Log [1] shows that the subnet of HA network is concurrently deleted while a new HA router create API comes. Seems the race conditon described in this bug is till exists : https://bugs.launchpad.net/neutron/+bug/1533440, where has description said: """ Some known exceptions: ... 2. IpAddressGenerationFailure: (HA port created failed due to the concurrently HA subnet deletion) ... """ Log [2] has a very strange behavior that those 3 APIs have a same request-id [req-780b1f6e-2b3c-4303-a1de-a5fb4c7ea31e]. Test scenario: Just create one HA router for a tenant, and then quickly delete it. For now, our mitaka ENV use VxLAN as tenant network type. So there is a very large range of VNI. So don't save that, and locally a temporary solution, we add a new config to decide whether delete the HA network every time. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1633306/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1633306] Re: Partial HA network causing HA router creation failed (race conditon)
** Changed in: neutron Status: Invalid => New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1633306 Title: Partial HA network causing HA router creation failed (race conditon) Status in neutron: New Bug description: ENV: stable/mitaka,VXLAN Neutron API: two neutron-servers behind a HA proxy VIP. Exception log: [1] http://paste.openstack.org/show/585669/ [2] http://paste.openstack.org/show/585670/ Log [1] shows that the subnet of HA network is concurrently deleted while a new HA router create API comes. Seems the race conditon described in this bug is till exists : https://bugs.launchpad.net/neutron/+bug/1533440, where has description said: """ Some known exceptions: ... 2. IpAddressGenerationFailure: (HA port created failed due to the concurrently HA subnet deletion) ... """ Log [2] has a very strange behavior that those 3 APIs have a same request-id [req-780b1f6e-2b3c-4303-a1de-a5fb4c7ea31e]. Test scenario: Just create one HA router for a tenant, and then quickly delete it. For now, our mitaka ENV use VxLAN as tenant network type. So there is a very large range of VNI. So don't save that, and temporarily solution, we add a new config to decide whether delete the HA network every time. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1633306/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1633306] Re: Partial HA network causing HA router creation failed (race conditon)
Adding a new configuration option is almost never temporary as deleting config options is rarely backward-compatible. The race condition, as I understand it, is as following: 1. Create HA router, have worker1 send 'router_updated' to agent1. 2. Delete HA router (done by worker2). worker2 will now detect that there are no more HA routers and will delete the HA network for the tenant. 3. agent1 issues a 'sync_router', which triggers auto_schedule_routers. create_ha_port_and_bind will try to create the HA port but there are no more IP addresses available, causing add_ha_port to fail as specified in the first paste. Point #3 is a bit weird to me, as it looks like IPAM is detecting a "network deleted during function run" as "no more IP addresses". In addition, this should be caught by [2], forcing a silent retrigger of this issue. Aside from the issue that isn't clear to me, I'd like to point out that the latest stable/mitaka [1] doesn't even trigger auto_schedule_routers on sync_router (not since [3] - perhaps you're missing this backport?) - hence the trace received in the first paste can't be reproduced. For this reason, I'm closing this as Invalid. Liu, feel free to reopen if you disagree with my assessment :) [1]: https://github.com/openstack/neutron/blob/5860fb21e966ab8f1e011654dd477d7af35f7a27/neutron/api/rpc/handlers/l3_rpc.py#L79 [2]: https://github.com/openstack/neutron/blob/5860fb21e966ab8f1e011654dd477d7af35f7a27/neutron/common/utils.py#L726 [3]: https://github.com/openstack/neutron/commit/33650bf1d1994a96eff993af0bfdaa62588f08a4 (5860fb21e966ab8f1e011654dd477d7af35f7a27 is the latest stable/mitaka hash that github.com provided.) ** Changed in: neutron Importance: High => Undecided ** Changed in: neutron Status: Confirmed => Invalid ** Changed in: neutron Milestone: ocata-1 => None -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1633306 Title: Partial HA network causing HA router creation failed (race conditon) Status in neutron: Invalid Bug description: ENV: stable/mitaka,VXLAN Neutron API: two neutron-servers behind a HA proxy VIP. Exception log: [1] http://paste.openstack.org/show/585669/ [2] http://paste.openstack.org/show/585670/ Log [1] shows that the subnet of HA network is concurrently deleted while a new HA router create API comes. Seems the race conditon described in this bug is till exists : https://bugs.launchpad.net/neutron/+bug/1533440, where has description said: """ Some known exceptions: ... 2. IpAddressGenerationFailure: (HA port created failed due to the concurrently HA subnet deletion) ... """ Log [2] has a very strange behavior that those 3 APIs have a same request-id [req-780b1f6e-2b3c-4303-a1de-a5fb4c7ea31e]. Test scenario: Just create one HA router for a tenant, and then quickly delete it. For now, our mitaka ENV use VxLAN as tenant network type. So there is a very large range of VNI. So don't save that, and temporarily solution, we add a new config to decide whether delete the HA network every time. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1633306/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp