[Yahoo-eng-team] [Bug 1943714] Re: DB session commit error in resource_registry.set_resources_dirty

2021-10-01 Thread OpenStack Infra
Reviewed:  https://review.opendev.org/c/openstack/neutron/+/809983
Committed: 
https://opendev.org/openstack/neutron/commit/603abeb977d4018963beade5c858b53f990ef32a
Submitter: "Zuul (22348)"
Branch:master

commit 603abeb977d4018963beade5c858b53f990ef32a
Author: Rodolfo Alonso Hernandez 
Date:   Mon Sep 20 09:27:39 2021 +

Execute the quota reservation removal in an isolated DB txn

The goal of [1] is to, in case of failing when removing the quota
reservation, continue the operation. Any expired reservation will
be removed automatically in any driver.

If the DB transaction fails, it should affect only to the reservation
trying to be deleted. This is why this patch isolates the
"remove_reservation" method and guarantees it is called outside an
active DB session. That guarantees, in case of failure, no other DB
operation will be affected.

This patch also partially reverts [2] but still checks the security
group rule quota when a new security group is created. Instead of
creating and releasing a quota reservation for the security group
rules created, now only the available quota limit is checked before
creating them. That won't prevent another operation to create security
group rules in parallel, exceeding the available quota. However, this
is not even guaranteed with the current quota driver.

[1]https://review.opendev.org/c/openstack/neutron/+/805031
[2]https://review.opendev.org/c/openstack/neutron/+/701565

Closes-Bug: #1943714

Change-Id: Id73368576a948f78a043d7cf0be16661a65626a9


** Changed in: neutron
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1943714

Title:
  DB session commit error in resource_registry.set_resources_dirty

Status in neutron:
  Fix Released

Bug description:
  It seems that patch
  https://review.opendev.org/c/openstack/neutron/+/805031 introduced
  some new error during call of resource_registry.set_resources_dirty()
  in
  
https://github.com/openstack/neutron/blob/6db261962894b1667dd213b116e89246a3e54386/neutron/api/v2/base.py#L506

  I didn't saw that issue in our CI jobs on master branch but we noticed
  them in the d/s jobs on OSP-16 which is based on Train. Error is like:

  2021-09-15 09:50:09.540 15 ERROR neutron.api.v2.resource 
six.reraise(self.type_, self.value, self.tb)

  [731/1883]
  2021-09-15 09:50:09.540 15 ERROR neutron.api.v2.resource   File 
"/usr/lib/python3.6/site-packages/six.py", line 675, in reraise 

 
  2021-09-15 09:50:09.540 15 ERROR neutron.api.v2.resource raise value  


   
  2021-09-15 09:50:09.540 15 ERROR neutron.api.v2.resource   File 
"/usr/lib/python3.6/site-packages/oslo_db/api.py", line 142, in wrapper 

 
  2021-09-15 09:50:09.540 15 ERROR neutron.api.v2.resource return f(*args, 
**kwargs)   


  2021-09-15 09:50:09.540 15 ERROR neutron.api.v2.resource   File 
"/usr/lib/python3.6/site-packages/neutron_lib/db/api.py", line 183, in wrapped  

 
  2021-09-15 09:50:09.540 15 ERROR neutron.api.v2.resource LOG.debug("Retry 
wrapper got retriable exception: %s", e)

   
  2021-09-15 09:50:09.540 15 ERROR neutron.api.v2.resource   File 
"/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in 
__exit__
 
  2021-09-15 09:50:09.540 15 ERROR neutron.api.v2.resource 
self.force_reraise()


  2021-09-15 09:50:09.540 15 ERROR neutron.api.v2.resource   File 
"/usr/lib/

[Yahoo-eng-team] [Bug 1935842] Re: [OVN] Set inactivity_probe for NB/SB databases

2021-10-01 Thread OpenStack Infra
Reviewed:  https://review.opendev.org/c/openstack/neutron/+/800521
Committed: 
https://opendev.org/openstack/neutron/commit/28f3017a90ecec6208aef9696cd7947972ec17d8
Submitter: "Zuul (22348)"
Branch:master

commit 28f3017a90ecec6208aef9696cd7947972ec17d8
Author: Rodolfo Alonso Hernandez 
Date:   Mon Jul 12 15:30:47 2021 +

[OVN] Set NB/SB "connection" inactivity probe

For each NB/SB databases, set the inactivity probe timeout value on
the Neutron configured connection register.

Neutron connects to the NB/SB databases with the value defined in
"ovn_nb_connection"/"ovn_sb_connection" config parameters, with the
format:
- tcp:IP:PORT
- ssl:IP:PORT
- unix:socket_file

The connection register "target" column has the following format:
- ptcp:PORT:IP
- pssl:PORT:IP
- punix:socket_file

Closes-Bug: #1935842
Change-Id: If82ff2c92b8bec18ae3e895ce19a65cfeed9c90d


** Changed in: neutron
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1935842

Title:
  [OVN] Set inactivity_probe for NB/SB databases

Status in neutron:
  Fix Released

Bug description:
  The 'inactivity_probe' parameter in OVN dbs schemas [0][1] represents
  the maximum number of milliseconds of idle time on connection to
  client before sending an inactivity probe message.

  This parameter is never set by Neutron and it is important to allow
  its configuration since it can be crucial at scale or under heavy load
  to avoid reconnections.

  [0] 
https://github.com/ovn-org/ovn/blob/95e41218fd7ea9c288c95744e80ba7caa90b30e2/ovn-sb.ovsschema#L275
  [1] 
https://github.com/ovn-org/ovn/blob/95e41218fd7ea9c288c95744e80ba7caa90b30e2/ovn-nb.ovsschema#L456

  Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1979231

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1935842/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1937084] Re: Nova thinks deleted volume is still attached

2021-10-01 Thread Lee Yarwood
https://review.opendev.org/c/openstack/nova/+/812127 - Nova will address
the 404 from c-api during an attachment delete here.

** Also affects: nova
   Importance: Undecided
   Status: New

** Changed in: nova
   Status: New => In Progress

** Changed in: nova
   Importance: Undecided => Medium

** Changed in: nova
 Assignee: (unassigned) => Lee Yarwood (lyarwood)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1937084

Title:
  Nova thinks deleted volume is still attached

Status in Cinder:
  In Progress
Status in OpenStack Compute (nova):
  In Progress

Bug description:
  There are cases where a cinder volume no longer exists yet nova still
  thinks it is attached to an instance and we cannot detach it anymore.

  This has been observed when running cinder-csi, where it makes a
  volume delete request as soon as  the volume status says its
  available.

  This is a cinder race condition, and like most race conditions is not
  simple to explain.

  Some context on the issue:

  - Cinder API uses the volume "status" field as a locking mechanism to prevent 
concurrent request processing on the same volume.
  - Most cinder operations are asynchronous, so the API returns before the 
operation has been completed by the cinder-volume service, but the attachment 
operations such as creating/updating/deleting an attachment are synchronous, so 
the API only returns to the caller after the cinder-volume service has 
completed the operation.
  - Our current code **incorrectly** modifies the status of the volume both on 
the cinder-volume and the cinder-api services on the attachment delete 
operation.

  The actual set of events that leads to the issue reported in this BZ
  are:

  [Cinder-CSI]
  - Requests Nova to detach volume (Request R1)

  [Nova]
  - R1: Asks cinder-api to delete the attachment and **waits**

  [Cinder-API]
  - R1: Checks the status of the volume
  - R1: Sends terminate connection request (R1) to cinder-volume and **waits**

  [Cinder-Volume]
  - R1: Ask the driver to terminate the connection
  - R1: The driver asks the backend to unmap and unexport the volume
  - R1: The status of the volume is changed in the DB to "available"

  [Cinder-CSI]
  - Asks Cinder to delete the volume (Request R2)

  [Cinder-API]
  - R2: Check that the volume's status is valid.  It's available so it can be 
deleted.
  - R2: Tell cinder-volume to delete the volume and return immediately.

  [Cinder-Volume]
  - R2: Volume is deleted and DB entry is deleted
  - R1: Finish the termination of the connection

  [Cinder-API]
  - R1: Now that cinder-volume has finished the termination the code continues
  - R1: Try to modify the volume in the DB
  - R1: DB layer raises VolumeNotFound since the volume has been deleted from 
the DB
  - R1: VolumeNotFound is converted to HTTP 404 status code which is returned 
to Nova

  [Nova]
  - R1: Cinder responds with 404 on the attachment delete request
  - R1: Nova leaves the volume as attached, since the attachment delete failed

  At this point the Cinder and Nova DBs are out of sync, because Nova
  thinks that the attachment is connected and Cinder has detached the
  volume and even deleted it.

  **This is caused by a Cinder bug**, but there is some robustification
  work that could be done in Nova, since the volume could be left in a
  "detached from instance" state (since the os-brick call succeeded),
  and a second detach request could directly skip the os-brick call and
  when it sees that the volume or the attachment no longer exists in
  Cinder it can proceed to remove it from the instance's XML.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cinder/+bug/1937084/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1813712] Re: [L2][scale issue] ovs-agent has multipe cookies flows (stale flows)

2021-10-01 Thread Slawek Kaplonski
** Changed in: neutron
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1813712

Title:
  [L2][scale issue] ovs-agent has multipe cookies flows (stale flows)

Status in neutron:
  Fix Released

Bug description:
  When subnets or security group ports quantity reach 2000+, there are many 
stale flows.
  Some basic exception procedure:
  (1) ovs-agent dump-flows
  (2) ovs-agent delete some flows
  (3) ovs-agent install new flows (with new cookies)
  (4) any exception raise in (2) or (3), such as (bug #1813705)
  (5) ovs-agent will do full sync again, then go back to (1)
  Finally it will get many stale flows installed, and sometimes cause data 
plane down.

  This is a subproblem of bug #1813703, for more information, please see the 
summary:
  https://bugs.launchpad.net/neutron/+bug/1813703

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1813712/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1813704] Re: [L2][scale issue] RPC timeout during ovs-agent restart

2021-10-01 Thread Slawek Kaplonski
** Changed in: neutron
   Status: Confirmed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1813704

Title:
  [L2][scale issue] RPC timeout during ovs-agent restart

Status in neutron:
  Fix Released

Bug description:
  When ports quantity under one subnet or security group reaches 2000+, the 
ovs-agent will always get RPC timeout during restart.
  This is a subproblem of bug #1813703, for more information, please see the 
summary:
  https://bugs.launchpad.net/neutron/+bug/1813703

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1813704/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1813705] Re: [L2][scale issue] local connection to ovs-vswitchd was drop or timeout

2021-10-01 Thread Slawek Kaplonski
** Changed in: neutron
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1813705

Title:
  [L2][scale issue] local connection to ovs-vswitchd was drop or timeout

Status in neutron:
  Fix Released

Bug description:
  When subnets or security group ports quantity reach 2000+, the ovs-agent 
connection to ovs-vswitchd may get lost, drop or timeout during restart.
  This is a subproblem of bug #1813703, for more information, please see the 
summary:
  https://bugs.launchpad.net/neutron/+bug/1813703

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1813705/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1813707] Re: [L2][scale issue] ovs-agent restart costs too long time

2021-10-01 Thread Slawek Kaplonski
** Changed in: neutron
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1813707

Title:
  [L2][scale issue] ovs-agent restart costs too long time

Status in neutron:
  Fix Released

Bug description:
  When subnets or security group ports quantity reach 2000+, the ovs-agent will 
take more than 15-40 mins+ to restart.
  During this restart time, the ovs will not process any port, aka VM booting 
on this host will not get the L2 flows established.
  This is a subproblem of bug #1813703, for more information, please see the 
summary:
  https://bugs.launchpad.net/neutron/+bug/1813703

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1813707/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1813706] Re: [L2][scale issue] ovs-agent failed to restart

2021-10-01 Thread Slawek Kaplonski
** Changed in: neutron
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1813706

Title:
  [L2][scale issue] ovs-agent failed to restart

Status in neutron:
  Fix Released

Bug description:
  When subnets or security group ports quantity reach 2000+, the ovs-agent 
failed to restart and do fullsync infinitely.
  This is a subproblem of bug #1813703, for more information, please see the 
summary:
  https://bugs.launchpad.net/neutron/+bug/1813703

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1813706/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1813708] Re: [L2][scale issue] ovs-agent has too many flows to do trouble shooting

2021-10-01 Thread Slawek Kaplonski
** Changed in: neutron
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1813708

Title:
  [L2][scale issue] ovs-agent has too many flows to do trouble shooting

Status in neutron:
  Fix Released

Bug description:
  When subnets or security group ports quantity reach 2000+, it is
  really too hard to do trouble shooting if one VM lost the connection.
  The flow tables are almost unreadable (reach 30k+ flows). We have no
  way to check the ovs-agent flow status. And restart the L2 agent does
  not help anymore, since we have so many issue at scale.

  This is a subproblem of bug #1813703, for more information, please see the 
summary:
  https://bugs.launchpad.net/neutron/+bug/1813703

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1813708/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1813709] Re: [L2][scale issue] ovs-agent dump-flows takes a lots of time

2021-10-01 Thread Slawek Kaplonski
** Changed in: neutron
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1813709

Title:
  [L2][scale issue] ovs-agent dump-flows takes a lots of time

Status in neutron:
  Fix Released

Bug description:
  ovs-agent clean stale flows action will dump all the bridge flows first. When 
subnets or security group ports quantity reach 2000+, this will become really 
time-consuming.
  And sometimes this dump action can also get failed, then the ovs-agent will 
dump again. And things get worse.
  This is a subproblem of bug #1813703, for more information, please see the 
summary:
  https://bugs.launchpad.net/neutron/+bug/1813703

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1813709/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1945215] Re: "process_floating_ip_nat_rules_for_centralized_floatingip" should check if self.snat_iptables_manager was initialized

2021-10-01 Thread OpenStack Infra
Reviewed:  https://review.opendev.org/c/openstack/neutron/+/811318
Committed: 
https://opendev.org/openstack/neutron/commit/f18edfdf450179f6bc8a47f3b143f2701bd93e0e
Submitter: "Zuul (22348)"
Branch:master

commit f18edfdf450179f6bc8a47f3b143f2701bd93e0e
Author: Rodolfo Alonso Hernandez 
Date:   Mon Sep 27 16:22:45 2021 +

[DVR] Check if SNAT iptables manager is initialized

Check if SNAT iptables manager is initialized before processing the
IP NAT rules. If the router never had an external GW port, the DVR
GW in the SNAT namespace has not been created and the SNAT iptables
manager has not been initialized.

In this case, the IP NAT rules for centralized FIPs (to be applied
on the SNAT namespace) cannot be set.

Closes-Bug: #1945215
Change-Id: I426602514805d728f8cd78e42f2b0979b2101089


** Changed in: neutron
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1945215

Title:
  "process_floating_ip_nat_rules_for_centralized_floatingip" should
  check if self.snat_iptables_manager was initialized

Status in neutron:
  Fix Released

Bug description:
  Environment:
  L3 agent configuration: agent_mode=dvr_snat.
  The L3 agent is located in a controller node, acting as a DVR edge router (no 
HA).

  Description:
  When "process_floating_ip_nat_rules_for_centralized_floatingip" is called, 
this method should check first if "self.snat_iptables_manager" has been 
initialized. The method 
"process_floating_ip_nat_rules_for_centralized_floatingip" is called from:
    <-- DvrEdgeRouter.process_floating_ip_nat_rules
    <-- RouterInfo.process_snat_dnat_for_fip
    <-- RouterInfo.process_external

  The method "RouterInfo.process_external" will first call
  "RouterInfo._process_external_gateway" -->
  "DvrEdgeRouter.external_gateway_added" -->
  "DvrEdgeRouter._create_dvr_gateway". This last method initializes the
  SNAT iptables manager [1] (this code has been around unchanged six
  years).

  However "DvrEdgeRouter.external_gateway_added" is only called if
  "ex_gw_port" exists. That means if the GW port does not exist, the
  SNAT iptables manager is None.

  Error example (snippet): https://paste.opendev.org/show/809621/

  This bug is similar to https://bugs.launchpad.net/neutron/+bug/1560945
  (related patch:
  https://review.opendev.org/c/openstack/neutron/+/296394).

  Steps to Reproduce:
  (I'm not 100% sure, I still need to check) Create a FIP in a SNAT DVR router 
without GW port.

  Bugzilla reference:
  https://bugzilla.redhat.com/show_bug.cgi?id=2008155

  
[1]https://github.com/openstack/neutron/blob/1d450dbddc8c3d34948ab3d9a8346dd491d9cc7c/neutron/agent/l3/dvr_edge_router.py#L196-L198

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1945215/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1945747] [NEW] GET security group rule is missing description attribute

2021-10-01 Thread Salvatore Orlando
Public bug reported:

The description attribute is missed attribute in
_make_security_group_rule_dict

Create sec group rule with desc

stack@bionic-template:~/devstack$ openstack security group rule create 
--description "test rule" --remote-ip 0.0.0.0/0 --ingress 
ff57f76f-93a0-4bf3-b538-c88df40fdc40
+---+--+
| Field | Value 

   |
+---+--+
| created_at| 2021-10-01T06:35:50Z  

   |
| description   | test rule 

   |
| direction | ingress   

   |
| ether_type| IPv4  

   |
| id| 389eb45e-58ac-471c-b966-a3c8784009f7  

   |
| location  | cloud='', project.domain_id='default', 
project.domain_name=, project.id='f2527eb734c745eca32b1dfbd9107563', 
project.name='admin', region_name='RegionOne', zone= |
| name  | None  

   |
| port_range_max| None  

   |
| port_range_min| None  

   |
| project_id| f2527eb734c745eca32b1dfbd9107563  

   |
| protocol  | None  

   |
| remote_group_id   | None  

   |
| remote_ip_prefix  | None  

   |
| revision_number   | 0 

   |
| security_group_id | ff57f76f-93a0-4bf3-b538-c88df40fdc40  

   |
| tags  | []

   |
| updated_at| 2021-10-01T06:35:50Z  

   |
+---+--+


Example get (no description)

RESP BODY: {"security_group_rule": {"id":
"389eb45e-58ac-471c-b966-a3c8784009f7", "tenant_id":
"f2527eb734c745eca32b1dfbd9107563", "security_group_id":
"ff57f76f-93a0-4bf3-b538-c88df40fdc40", "ethertype": "IPv4",
"direction": "ingress", "protocol": null, "port_range_min": null,
"port_range_max": null, "remote_ip_prefix": "0.0.0.0/0",
"remote_group_id": null, "local_ip_prefix": null, "created_at":
"2021-10-01T06:35:50Z", "updated_at": "2021-10-01T06:35:50Z",
"revision_number": 0, "project_id": "f2527eb734c745eca32b1dfbd9107563"}}

Potential fix (patch applies to stable/ussuri, not master)

diff --git a/neutron/db/securitygroups_db.py b/neutron/db/securitygroups_db.py
index 28238358ae..0c848bbe38 100644
--- a/neutron/db/securitygroups