[Yahoo-eng-team] [Bug 1943714] Re: DB session commit error in resource_registry.set_resources_dirty
Reviewed: https://review.opendev.org/c/openstack/neutron/+/809983 Committed: https://opendev.org/openstack/neutron/commit/603abeb977d4018963beade5c858b53f990ef32a Submitter: "Zuul (22348)" Branch:master commit 603abeb977d4018963beade5c858b53f990ef32a Author: Rodolfo Alonso Hernandez Date: Mon Sep 20 09:27:39 2021 + Execute the quota reservation removal in an isolated DB txn The goal of [1] is to, in case of failing when removing the quota reservation, continue the operation. Any expired reservation will be removed automatically in any driver. If the DB transaction fails, it should affect only to the reservation trying to be deleted. This is why this patch isolates the "remove_reservation" method and guarantees it is called outside an active DB session. That guarantees, in case of failure, no other DB operation will be affected. This patch also partially reverts [2] but still checks the security group rule quota when a new security group is created. Instead of creating and releasing a quota reservation for the security group rules created, now only the available quota limit is checked before creating them. That won't prevent another operation to create security group rules in parallel, exceeding the available quota. However, this is not even guaranteed with the current quota driver. [1]https://review.opendev.org/c/openstack/neutron/+/805031 [2]https://review.opendev.org/c/openstack/neutron/+/701565 Closes-Bug: #1943714 Change-Id: Id73368576a948f78a043d7cf0be16661a65626a9 ** Changed in: neutron Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1943714 Title: DB session commit error in resource_registry.set_resources_dirty Status in neutron: Fix Released Bug description: It seems that patch https://review.opendev.org/c/openstack/neutron/+/805031 introduced some new error during call of resource_registry.set_resources_dirty() in https://github.com/openstack/neutron/blob/6db261962894b1667dd213b116e89246a3e54386/neutron/api/v2/base.py#L506 I didn't saw that issue in our CI jobs on master branch but we noticed them in the d/s jobs on OSP-16 which is based on Train. Error is like: 2021-09-15 09:50:09.540 15 ERROR neutron.api.v2.resource six.reraise(self.type_, self.value, self.tb) [731/1883] 2021-09-15 09:50:09.540 15 ERROR neutron.api.v2.resource File "/usr/lib/python3.6/site-packages/six.py", line 675, in reraise 2021-09-15 09:50:09.540 15 ERROR neutron.api.v2.resource raise value 2021-09-15 09:50:09.540 15 ERROR neutron.api.v2.resource File "/usr/lib/python3.6/site-packages/oslo_db/api.py", line 142, in wrapper 2021-09-15 09:50:09.540 15 ERROR neutron.api.v2.resource return f(*args, **kwargs) 2021-09-15 09:50:09.540 15 ERROR neutron.api.v2.resource File "/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 183, in wrapped 2021-09-15 09:50:09.540 15 ERROR neutron.api.v2.resource LOG.debug("Retry wrapper got retriable exception: %s", e) 2021-09-15 09:50:09.540 15 ERROR neutron.api.v2.resource File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2021-09-15 09:50:09.540 15 ERROR neutron.api.v2.resource self.force_reraise() 2021-09-15 09:50:09.540 15 ERROR neutron.api.v2.resource File "/usr/lib/
[Yahoo-eng-team] [Bug 1935842] Re: [OVN] Set inactivity_probe for NB/SB databases
Reviewed: https://review.opendev.org/c/openstack/neutron/+/800521 Committed: https://opendev.org/openstack/neutron/commit/28f3017a90ecec6208aef9696cd7947972ec17d8 Submitter: "Zuul (22348)" Branch:master commit 28f3017a90ecec6208aef9696cd7947972ec17d8 Author: Rodolfo Alonso Hernandez Date: Mon Jul 12 15:30:47 2021 + [OVN] Set NB/SB "connection" inactivity probe For each NB/SB databases, set the inactivity probe timeout value on the Neutron configured connection register. Neutron connects to the NB/SB databases with the value defined in "ovn_nb_connection"/"ovn_sb_connection" config parameters, with the format: - tcp:IP:PORT - ssl:IP:PORT - unix:socket_file The connection register "target" column has the following format: - ptcp:PORT:IP - pssl:PORT:IP - punix:socket_file Closes-Bug: #1935842 Change-Id: If82ff2c92b8bec18ae3e895ce19a65cfeed9c90d ** Changed in: neutron Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1935842 Title: [OVN] Set inactivity_probe for NB/SB databases Status in neutron: Fix Released Bug description: The 'inactivity_probe' parameter in OVN dbs schemas [0][1] represents the maximum number of milliseconds of idle time on connection to client before sending an inactivity probe message. This parameter is never set by Neutron and it is important to allow its configuration since it can be crucial at scale or under heavy load to avoid reconnections. [0] https://github.com/ovn-org/ovn/blob/95e41218fd7ea9c288c95744e80ba7caa90b30e2/ovn-sb.ovsschema#L275 [1] https://github.com/ovn-org/ovn/blob/95e41218fd7ea9c288c95744e80ba7caa90b30e2/ovn-nb.ovsschema#L456 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1979231 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1935842/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1937084] Re: Nova thinks deleted volume is still attached
https://review.opendev.org/c/openstack/nova/+/812127 - Nova will address the 404 from c-api during an attachment delete here. ** Also affects: nova Importance: Undecided Status: New ** Changed in: nova Status: New => In Progress ** Changed in: nova Importance: Undecided => Medium ** Changed in: nova Assignee: (unassigned) => Lee Yarwood (lyarwood) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1937084 Title: Nova thinks deleted volume is still attached Status in Cinder: In Progress Status in OpenStack Compute (nova): In Progress Bug description: There are cases where a cinder volume no longer exists yet nova still thinks it is attached to an instance and we cannot detach it anymore. This has been observed when running cinder-csi, where it makes a volume delete request as soon as the volume status says its available. This is a cinder race condition, and like most race conditions is not simple to explain. Some context on the issue: - Cinder API uses the volume "status" field as a locking mechanism to prevent concurrent request processing on the same volume. - Most cinder operations are asynchronous, so the API returns before the operation has been completed by the cinder-volume service, but the attachment operations such as creating/updating/deleting an attachment are synchronous, so the API only returns to the caller after the cinder-volume service has completed the operation. - Our current code **incorrectly** modifies the status of the volume both on the cinder-volume and the cinder-api services on the attachment delete operation. The actual set of events that leads to the issue reported in this BZ are: [Cinder-CSI] - Requests Nova to detach volume (Request R1) [Nova] - R1: Asks cinder-api to delete the attachment and **waits** [Cinder-API] - R1: Checks the status of the volume - R1: Sends terminate connection request (R1) to cinder-volume and **waits** [Cinder-Volume] - R1: Ask the driver to terminate the connection - R1: The driver asks the backend to unmap and unexport the volume - R1: The status of the volume is changed in the DB to "available" [Cinder-CSI] - Asks Cinder to delete the volume (Request R2) [Cinder-API] - R2: Check that the volume's status is valid. It's available so it can be deleted. - R2: Tell cinder-volume to delete the volume and return immediately. [Cinder-Volume] - R2: Volume is deleted and DB entry is deleted - R1: Finish the termination of the connection [Cinder-API] - R1: Now that cinder-volume has finished the termination the code continues - R1: Try to modify the volume in the DB - R1: DB layer raises VolumeNotFound since the volume has been deleted from the DB - R1: VolumeNotFound is converted to HTTP 404 status code which is returned to Nova [Nova] - R1: Cinder responds with 404 on the attachment delete request - R1: Nova leaves the volume as attached, since the attachment delete failed At this point the Cinder and Nova DBs are out of sync, because Nova thinks that the attachment is connected and Cinder has detached the volume and even deleted it. **This is caused by a Cinder bug**, but there is some robustification work that could be done in Nova, since the volume could be left in a "detached from instance" state (since the os-brick call succeeded), and a second detach request could directly skip the os-brick call and when it sees that the volume or the attachment no longer exists in Cinder it can proceed to remove it from the instance's XML. To manage notifications about this bug go to: https://bugs.launchpad.net/cinder/+bug/1937084/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1813712] Re: [L2][scale issue] ovs-agent has multipe cookies flows (stale flows)
** Changed in: neutron Status: New => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1813712 Title: [L2][scale issue] ovs-agent has multipe cookies flows (stale flows) Status in neutron: Fix Released Bug description: When subnets or security group ports quantity reach 2000+, there are many stale flows. Some basic exception procedure: (1) ovs-agent dump-flows (2) ovs-agent delete some flows (3) ovs-agent install new flows (with new cookies) (4) any exception raise in (2) or (3), such as (bug #1813705) (5) ovs-agent will do full sync again, then go back to (1) Finally it will get many stale flows installed, and sometimes cause data plane down. This is a subproblem of bug #1813703, for more information, please see the summary: https://bugs.launchpad.net/neutron/+bug/1813703 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1813712/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1813704] Re: [L2][scale issue] RPC timeout during ovs-agent restart
** Changed in: neutron Status: Confirmed => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1813704 Title: [L2][scale issue] RPC timeout during ovs-agent restart Status in neutron: Fix Released Bug description: When ports quantity under one subnet or security group reaches 2000+, the ovs-agent will always get RPC timeout during restart. This is a subproblem of bug #1813703, for more information, please see the summary: https://bugs.launchpad.net/neutron/+bug/1813703 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1813704/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1813705] Re: [L2][scale issue] local connection to ovs-vswitchd was drop or timeout
** Changed in: neutron Status: New => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1813705 Title: [L2][scale issue] local connection to ovs-vswitchd was drop or timeout Status in neutron: Fix Released Bug description: When subnets or security group ports quantity reach 2000+, the ovs-agent connection to ovs-vswitchd may get lost, drop or timeout during restart. This is a subproblem of bug #1813703, for more information, please see the summary: https://bugs.launchpad.net/neutron/+bug/1813703 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1813705/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1813707] Re: [L2][scale issue] ovs-agent restart costs too long time
** Changed in: neutron Status: New => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1813707 Title: [L2][scale issue] ovs-agent restart costs too long time Status in neutron: Fix Released Bug description: When subnets or security group ports quantity reach 2000+, the ovs-agent will take more than 15-40 mins+ to restart. During this restart time, the ovs will not process any port, aka VM booting on this host will not get the L2 flows established. This is a subproblem of bug #1813703, for more information, please see the summary: https://bugs.launchpad.net/neutron/+bug/1813703 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1813707/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1813706] Re: [L2][scale issue] ovs-agent failed to restart
** Changed in: neutron Status: New => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1813706 Title: [L2][scale issue] ovs-agent failed to restart Status in neutron: Fix Released Bug description: When subnets or security group ports quantity reach 2000+, the ovs-agent failed to restart and do fullsync infinitely. This is a subproblem of bug #1813703, for more information, please see the summary: https://bugs.launchpad.net/neutron/+bug/1813703 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1813706/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1813708] Re: [L2][scale issue] ovs-agent has too many flows to do trouble shooting
** Changed in: neutron Status: New => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1813708 Title: [L2][scale issue] ovs-agent has too many flows to do trouble shooting Status in neutron: Fix Released Bug description: When subnets or security group ports quantity reach 2000+, it is really too hard to do trouble shooting if one VM lost the connection. The flow tables are almost unreadable (reach 30k+ flows). We have no way to check the ovs-agent flow status. And restart the L2 agent does not help anymore, since we have so many issue at scale. This is a subproblem of bug #1813703, for more information, please see the summary: https://bugs.launchpad.net/neutron/+bug/1813703 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1813708/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1813709] Re: [L2][scale issue] ovs-agent dump-flows takes a lots of time
** Changed in: neutron Status: New => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1813709 Title: [L2][scale issue] ovs-agent dump-flows takes a lots of time Status in neutron: Fix Released Bug description: ovs-agent clean stale flows action will dump all the bridge flows first. When subnets or security group ports quantity reach 2000+, this will become really time-consuming. And sometimes this dump action can also get failed, then the ovs-agent will dump again. And things get worse. This is a subproblem of bug #1813703, for more information, please see the summary: https://bugs.launchpad.net/neutron/+bug/1813703 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1813709/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1945215] Re: "process_floating_ip_nat_rules_for_centralized_floatingip" should check if self.snat_iptables_manager was initialized
Reviewed: https://review.opendev.org/c/openstack/neutron/+/811318 Committed: https://opendev.org/openstack/neutron/commit/f18edfdf450179f6bc8a47f3b143f2701bd93e0e Submitter: "Zuul (22348)" Branch:master commit f18edfdf450179f6bc8a47f3b143f2701bd93e0e Author: Rodolfo Alonso Hernandez Date: Mon Sep 27 16:22:45 2021 + [DVR] Check if SNAT iptables manager is initialized Check if SNAT iptables manager is initialized before processing the IP NAT rules. If the router never had an external GW port, the DVR GW in the SNAT namespace has not been created and the SNAT iptables manager has not been initialized. In this case, the IP NAT rules for centralized FIPs (to be applied on the SNAT namespace) cannot be set. Closes-Bug: #1945215 Change-Id: I426602514805d728f8cd78e42f2b0979b2101089 ** Changed in: neutron Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1945215 Title: "process_floating_ip_nat_rules_for_centralized_floatingip" should check if self.snat_iptables_manager was initialized Status in neutron: Fix Released Bug description: Environment: L3 agent configuration: agent_mode=dvr_snat. The L3 agent is located in a controller node, acting as a DVR edge router (no HA). Description: When "process_floating_ip_nat_rules_for_centralized_floatingip" is called, this method should check first if "self.snat_iptables_manager" has been initialized. The method "process_floating_ip_nat_rules_for_centralized_floatingip" is called from: <-- DvrEdgeRouter.process_floating_ip_nat_rules <-- RouterInfo.process_snat_dnat_for_fip <-- RouterInfo.process_external The method "RouterInfo.process_external" will first call "RouterInfo._process_external_gateway" --> "DvrEdgeRouter.external_gateway_added" --> "DvrEdgeRouter._create_dvr_gateway". This last method initializes the SNAT iptables manager [1] (this code has been around unchanged six years). However "DvrEdgeRouter.external_gateway_added" is only called if "ex_gw_port" exists. That means if the GW port does not exist, the SNAT iptables manager is None. Error example (snippet): https://paste.opendev.org/show/809621/ This bug is similar to https://bugs.launchpad.net/neutron/+bug/1560945 (related patch: https://review.opendev.org/c/openstack/neutron/+/296394). Steps to Reproduce: (I'm not 100% sure, I still need to check) Create a FIP in a SNAT DVR router without GW port. Bugzilla reference: https://bugzilla.redhat.com/show_bug.cgi?id=2008155 [1]https://github.com/openstack/neutron/blob/1d450dbddc8c3d34948ab3d9a8346dd491d9cc7c/neutron/agent/l3/dvr_edge_router.py#L196-L198 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1945215/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1945747] [NEW] GET security group rule is missing description attribute
Public bug reported: The description attribute is missed attribute in _make_security_group_rule_dict Create sec group rule with desc stack@bionic-template:~/devstack$ openstack security group rule create --description "test rule" --remote-ip 0.0.0.0/0 --ingress ff57f76f-93a0-4bf3-b538-c88df40fdc40 +---+--+ | Field | Value | +---+--+ | created_at| 2021-10-01T06:35:50Z | | description | test rule | | direction | ingress | | ether_type| IPv4 | | id| 389eb45e-58ac-471c-b966-a3c8784009f7 | | location | cloud='', project.domain_id='default', project.domain_name=, project.id='f2527eb734c745eca32b1dfbd9107563', project.name='admin', region_name='RegionOne', zone= | | name | None | | port_range_max| None | | port_range_min| None | | project_id| f2527eb734c745eca32b1dfbd9107563 | | protocol | None | | remote_group_id | None | | remote_ip_prefix | None | | revision_number | 0 | | security_group_id | ff57f76f-93a0-4bf3-b538-c88df40fdc40 | | tags | [] | | updated_at| 2021-10-01T06:35:50Z | +---+--+ Example get (no description) RESP BODY: {"security_group_rule": {"id": "389eb45e-58ac-471c-b966-a3c8784009f7", "tenant_id": "f2527eb734c745eca32b1dfbd9107563", "security_group_id": "ff57f76f-93a0-4bf3-b538-c88df40fdc40", "ethertype": "IPv4", "direction": "ingress", "protocol": null, "port_range_min": null, "port_range_max": null, "remote_ip_prefix": "0.0.0.0/0", "remote_group_id": null, "local_ip_prefix": null, "created_at": "2021-10-01T06:35:50Z", "updated_at": "2021-10-01T06:35:50Z", "revision_number": 0, "project_id": "f2527eb734c745eca32b1dfbd9107563"}} Potential fix (patch applies to stable/ussuri, not master) diff --git a/neutron/db/securitygroups_db.py b/neutron/db/securitygroups_db.py index 28238358ae..0c848bbe38 100644 --- a/neutron/db/securitygroups